scardac

Waveform archive data availability collector.

Description

scardac scans an SDS waveform archive, e.g., created by slarchive or scart for available miniSEED data. It will collect information about

  • DataExtents – the earliest and latest times data is available for a particular channel,

  • DataAttributeExtents – the earliest and latest times data is available for a particular channel, quality and sampling rate combination,

  • DataSegments – continuous data segments sharing the same quality and sampling rate attributes.

scardac is intended to be executed periodically, e.g., as a cronjob.

The availability data information is stored in the SeisComP database under the root element DataAvailability. Access to the availability data is provided by the fdsnws module via the services:

  • /fdsnws/station (extent information only, see matchtimeseries and includeavailability request parameters).

  • /fdsnws/ext/availability (extent and segment information provided in different formats)

Non-SDS archives

scardac can be extended by plugins to scan non-SDS archives. For example the daccaps plugin provided by CAPS [3] allows scanning archives generated by a CAPS server. Plugins are added to the global module configuration, e.g.:

plugins = ${plugins}, daccaps

Definitions

  • Record – continuous waveform data of same sampling rate and quality bound by a start and end time. scardac will only read the record’s meta data and not the actual samples.

  • Chunk – container for records, e.g., a miniSEED file, with the following properties:

    • overall, theoretical time range of records it may contain

    • contains at least one record, otherwise it must be absent

    • each record of a chunk must fulfill the following conditions:

      • chunk start <= record start < chunk end

      • chunk start < record end < next chunk end

    • chunks do not overlap, end time of current chunk equals start time of successive chunk, otherwise a chunk gap is declared

    • records may occur unordered within a chunk or across chunk boundaries, resulting in DataSegments marked as outOfOrder

  • Jitter – maximum allowed deviation between the end time of the current record and the start time of the next record in multiples of the current’s record sampling rate. E.g., assuming a sampling rate of 100Hz and a jitter of 0.5 will allow for a maximum end to start time difference of 50ms. If exceeded a new DataSegment is created.

  • Mtime – time the content of a chunk was last modified. It is used to

    • decided whether a chunk needs to be read in a secondary application run

    • calculate the updated time stamp of a DataSegment, DataAttributeExtent and DataExtent

  • Scan window – time window limiting the synchronization of the archive with the database configured via filter.time.start and filter.time.end respectively --start and --end. The scan window is useful to

    • reduce the scan time of larger archives. Depending on the size and storage type of the archive it may take some time to just list available chunks and their mtime.

    • prevent deletion of availability information even though parts of the archive have been deleted or moved to a different location

  • Modification window – the mtime of a chunk is compared with this time window to decide whether it needs to be read or not. It is configured via mtime.start and mtime.end repectively --modified-since and --modified-until. If no lower bound is defined then the lastScan time stored in the DataExtent is used instead. The mtime check may be disabled using mtime.ignore or --deep-scan. Note: Chunks in front or right after a chunk gap are read in any case regardless of the mtime settings.

Workflow

  1. Read existing DataExtents from database.

  2. Collect a list of available stream IDs either by

    • scanning the archive for available IDs or

    • reading an ID file defined by nslcFile.

  3. Identify extents to add, update or remove respecting scan window, filter.nslc.include and filter.nslc.exclude.

  4. Subsequently process the DataExtents using threads number of parallel threads. For each DataExtent:

    1. Collect all available chunks within scan window.

    2. If the DataExtent is new (no database entry yet), store a new and empty DataExtent to database, else query existing DataSegments from the database:

      • count segments outside scan window

      • create a database iterator for extents within scan window

    3. Create two in-memory segment lists which collect segments to remove and segments to add/update

    4. For each chunk

      • determine the chunk window and mtime

      • decide whether the chunk needs to be read depending on the mtime and a possible chunk gap. If necessary, read the chunk and

        • create chunk segments by analyzing the chunk records for gaps/overlaps defined by jitter, sampling rate or quality changes

        • merge chunk segments with database segments and update the in-memory segment lists.

        If not necessary, advance the database segment iterator to the end of the chunk window.

    5. Remove and then add/update the collected segments.

    6. Merge segment information into DataAttributeExtents

    7. Merge DataAttributeExtents into overall DataExtent

Examples

  1. Get command line help or execute scardac with default parameters and informative debug output:

    scardac -h
    scardac --debug
    
  2. Synchronize the availability of waveform data files existing in the standard SDS archive with the seiscomp database and create an XML file using scxmldump:

    scardac -d mysql://sysop:sysop@localhost/seiscomp -a $SEISCOMP_ROOT/var/lib/archive --debug
    scxmldump -Yf -d mysql://sysop:sysop@localhost/seiscomp -o availability.xml
    
  3. Synchronize the availability of waveform data files existing in the standard SDS archive with the seiscomp database. Use fdsnws to fetch a flat file containing a list of periods of available data from stations of the CX network sharing the same quality and sampling rate attributes:

    scardac -d mysql://sysop:sysop@localhost/seiscomp -a $SEISCOMP_ROOT/var/lib/archive
    wget -O availability.txt 'http://localhost:8080/fdsnws/ext/availability/1/query?network=CX'
    

    Note

    The SeisComP module fdsnws must be running for executing this example.

Module Configuration

etc/defaults/global.cfg
etc/defaults/scardac.cfg
etc/global.cfg
etc/scardac.cfg
~/.seiscomp/global.cfg
~/.seiscomp/scardac.cfg

scardac inherits global options.

archive

Default: @SEISCOMP_ROOT@/var/lib/archive

Type: string

The URL to the waveform archive where all data is stored.

Format: [service://]location[#type]

"service": The type of the archive. If not given, "sds://" is implied assuming an SDS archive. The SDS archive structure is defined as YEAR/NET/STA/CHA/NET.STA.LOC.CHA.YEAR.DAYFYEAR, e.g. 2018/GE/APE/BHZ.D/GE.APE..BHZ.D.2018.125

Other archive types may be considered by plugins.

threads

Default: 1

Type: int

Number of threads scanning the archive in parallel.

jitter

Default: 0.5

Type: float

Acceptable derivation of end time and start time of successive records in multiples of sample time.

maxSegments

Default: 1000000

Type: int

Maximum number of segments per stream. If the limit is reached no more segments are added to the database and the corresponding extent is flagged as too fragmented. Set this parameter to 0 to disable any limits.

nslcFile

Type: string

Line-based text file of form NET.STA.LOC.CHA defining available stream IDs. Depending on the archive type, size and storage media used this file may offer a significant performance improvement compared to collecting the available streams on each startup. Filters defined under filter.nslc still apply.

Note

filter.* Parameters of this section limit the data processing to either ** reduce the scan time of larger archives or to ** prevent deletion of availability information even though parts of the archive have been deleted or moved to a different location.

Note

filter.time.* Limit the processing by record time.

filter.time.start

Type: string

Start of data availability check given as date string or as number of days before now.

filter.time.end

Type: string

End of data availability check given as date string or as number of days before now.

Note

filter.nslc.* Limit the processing by stream IDs.

filter.nslc.include

Type: list:string

Comma-separated list of stream IDs to process. If empty all streams are accepted unless an exclude filter is defined. The following wildcards are supported: ‘*’ and ‘?’.

filter.nslc.exclude

Type: list:string

Comma-separated list of stream IDs to exclude from processing. Excludes take precedence over includes. The following wildcards are supported: ‘*’ and ‘?’.

Note

mtime.* Parameters of this section control the rescan of data chunks. By default the last update time of the extent is compared with the record file modification time to read only files modified since the list run.

mtime.ignore

Default: false

Type: boolean

If set to true all data chunks are read independent of their mtime.

mtime.start

Type: string

Only read chunks modified after specific date given as date string or as number of days before now.

mtime.end

Type: string

Only read chunks modified before specific date given as date string or as number of days before now.

Command-Line Options

scardac [OPTION]...

Generic

-h, --help

Show help message.

-V, --version

Show version information.

--config-file arg

Use alternative configuration file. When this option is used the loading of all stages is disabled. Only the given configuration file is parsed and used. To use another name for the configuration create a symbolic link of the application or copy it. Example: scautopick -> scautopick2.

--plugins arg

Load given plugins.

Verbosity

--verbosity arg

Verbosity level [0..4]. 0:quiet, 1:error, 2:warning, 3:info, 4:debug.

-v, --v

Increase verbosity level (may be repeated, eg. -vv).

-q, --quiet

Quiet mode: no logging output.

--print-component arg

For each log entry print the component right after the log level. By default the component output is enabled for file output but disabled for console output.

--component arg

Limit the logging to a certain component. This option can be given more than once.

-s, --syslog

Use syslog logging backend. The output usually goes to /var/lib/messages.

-l, --lockfile arg

Path to lock file.

--console arg

Send log output to stdout.

--debug

Execute in debug mode. Equivalent to --verbosity=4 --console=1 .

--trace

Execute in trace mode. Equivalent to --verbosity=4 --console=1 --print-component=1 --print-context=1 .

--log-file arg

Use alternative log file.

Collector

-a, --archive arg

Overrides configuration parameter archive.

--threads arg

Overrides configuration parameter threads.

-j, --jitter arg

Overrides configuration parameter jitter.

--nslc arg

Overrides configuration parameter nslcFile.

--start arg

Overrides configuration parameter filter.time.start.

--end arg

Overrides configuration parameter filter.time.end.

--include arg

Overrides configuration parameter filter.nslc.include.

--exclude arg

Overrides configuration parameter filter.nslc.exclude.

--deep-scan

Overrides configuration parameter mtime.ignore.

--modified-since arg

Overrides configuration parameter mtime.start.

--modified-until arg

Overrides configuration parameter mtime.end.

--generate-test-data arg

Do not scan the archive but generate test data for each stream in the inventory. Format: days,gaps,gapslen,overlaps,overlaplen. E.g., the following parameter list would generate test data for 100 days (starting from now()-100) which includes 150 gaps with a length of 2.5s followed by 50 overlaps with an overlap of 5s: --generate-test-data=100,150,2.5,50,5