STAC builder for Tuulituhohaukka

Devcontainer

This application uses VSCode devcontainers to normalize the development environment. The project may however be developed and built also without devcontainers. However using the devcontainer guarantees that every developer is using the same development and build environment.

You can read more about devcontainers here.

Run STAC builder

How does it work?

STAC builder builds a Spatio Temporal Asset Catalog according to STAC 1.0.0 version specifications. The builder reads URLs from Amazon S3 service, extracts necessary information and builds item objects. Then, based on the existing items, dataset-time and dataset collections are created.

The items are written locally to your computer. The item builder (item_builder.py) reads item.template -field from the configuration file, and uses this information to complete the items. If you want to add information to the item objects, you can modify this field in the configuration file. New items are created whenever an item with a unique time-stamp is found. The existing item is updated when new assets or bands of that item are found. Existing items are not re-written in any other way than by adding assets or bands to it. Note: This means that when you run the script, it will only process items and assets that do not exist yet. It also means that if you make some changes to the template of the items, you need to delete all existing items to make the changes visible to them. The tool may also process metadata from (.dim) files and include that in the resulting items.

The dataset and dataset-time collections are re-written every time you run the script. The catalog builder (catalog_builder.py) reads dataset.template and dataset-time.template -fields from the configuration file, and uses this information to complete the collections. If you want to add information to the collection objects, you can modify this field in the configuration file.

How do you use it?

You can build the Spatio Temporal Asset Catalog (STAC) by running the python script stac_builder.py.

Run the script by giving it as input the location of the dataset's configuration file and the location of s3 configuration file, writing

$ python stac_builder.py "dataset_config_file_location" "s3_config_file_location"

For example:

$ python stac_builder.py configuration_files/conf_Korjuukelpoisuus.json configuration_files/s3-config.json

Run S3 connection test

You can check if your S3 connection works by running s3_connection_test.py.

Run the script by giving it as input the location of the dataset's configuration file and the location of S3 configuration file, writing

$ python s3_connection_test.py "dataset_config_file_location" "s3_config_file_location"

For example:

$ python s3_connection_test.py configuration_files/conf_Korjuukelpoisuus.json configuration_files/s3-config.json

Configuration files

S3 config

S3 connection information is passed to stac_builder in a separate configuration file. The file is JSON

Field	Description
aws_access_key_id	S3 access key (generated by AWS/CEPH or other S3 compatible platform)
aws_secret_access_key	The secret part of the S3 access key
endpoint_url	S3 endpoint URL to use when connecting to S3

Dataset configuration

Field	Datatype	Description
datasetId	string	Unique identifier for the dataset. Preferably short, readable and without special characters or whitespace
source.s3Bucket	string	Name of the S3 bucket where assets are stored
source.s3Prefixes	[string]	A list of S3 prefixes where assets are stored
source.publicUrlPrefix	string	Public URL endpoint where assets are accessible via HTTP(s). Used for asset links
source.gdalUrlPrefix	string (optional)	Optional URL prefix used to access the assets during the cataloguing process. Useful when for example networking conditions prohibit using the public facing URL on a server.
destination.localItemPath	string	Local filesystem path where STAC items are written and read from
destination.localCatalogPath	string	Local filesystem path where STAC catalogues are written and read from
destination.catalogBaseUrl	string	Public URL prefix where catalog files will be accessible from. Used to produce links between STAC documents.
destination.itemBaseUrl	string	Public URL prefix where item files will be accessible from. Used to produce links between STAC documents.
blacklist	[string]	List of assets that should not be processed when producing this STAC catalog. URLs formatted: publicUrlPrefix + S3 key
dataset.template	object	STAC catalogue template that is used to produce the JSON files for dataset catalogues
dataset-time.template	object	STAC catalogue template that is used to produce the JSON files for dataset-time catalogues
dataset-time.timeFrame	"week", "month" or "year"	Whether dataset-time catalogues are produced for every week, month or year
item.fileNamingConvention	string (python regex)	Regex rule that is used to 1) determine which assets should be read into the catalogue, 2) determine information about that asset. More about this below under "Item file naming convention"
item.idTemplate	string	String template for producing unique identifiers for assets. More about this below under "Item ID template"
item.roles	"data" or .. ?	The value for the "role" attribute in asset links in STAC items
item.template	object	STAC item template that is used to produce the JSON files
item.metadata	object (optional)	Optional object for item metadata extraction from .dim files. More about this below under "Item metadata extraction"
item.mosaicDuration	object (optional)	Optional object that supplies expected minimum and maximum durations for items that should belong in this dataset. More about this under "Item mosaic duration"

Item file naming convention

File naming convention is supplied as a regex pattern that should match the names of asset files (not including paths). This pattern contains named groups that are used to derive information about the assets.

group name	meaning
startdate	the starting date of the time the asset covers
starttime	the wall clock time (hours, minutes) of the asset start time (optional)
enddate	the end date of the time the asset covers (optional)
starttime	the wall clock time (hours, minutes) of the asset end time (optional)
band	which band this asset contains
tile	id of tile (optional, only makes sense for tiled datasets)

Item ID template

The item.idTemplate string is used to produce unique identifiers for items. The string value may contain special words that will be replaced by data derived from the item file. The following special values are supported:

word	meaning
startdate	The start date of the item
enddate	The end date (if specified) of the item
tile_id	The tile ID derived from the item (for example military grid tile id)

Item metadata extraction

If extra metadata is needed in the STAC item files, stac_builder can read this from any XML formatted source (e.g., .dim and .tif.aux.xml files). The confiuration contains a file naming convention like with item assets and a number of extraction rules. The extraction rules in field soupExtraction are in the BeautifulSoup syntax and results are written written in the STAC item in the JSON path set in writeToItemField. The metadata file is also stored as an asset in the STAC item with the role metadata.

The extractionRules may contain a third field rule. If this field is set to .get_text(), the text contents of the soupExtraction match is stored, otherwise the structure of the document within match is stored.

If the value of writeToItemField is metadata_field, extracted metadata is written in the metadata asset (within a field called metadata).

Caveat: note that the metadata needs to be located in a S3 prefix (or prefixes) that are listed last in source.s3Prefixes. Otherwise metadata injection will be delayed. If the metadata files are processed before geotiff assets, the metadata will be injected on the next time the catalogue is processed.

  "metadata": {
    "fileNamingConventionMeta": "^S1_processed_(?P<startdate>[1-2][0-9]{3}[0-1][0-9][0-3][0-9])_(?P<starttime>[0-9]{6})_(?P<endtime>[0-9]{6})_(?P<ratakierto>.{13}).dim$",
    "extractionRules": [{
      "writeToItemField": "properties.orbit",
      "soupExtraction": "Dimap_Document > Dataset_Sources > MDElem[name='metadata'] > MDElem[name='Abstracted_Metadata'] > MDATTR[name='PASS']"
    }]
  }

Item mosaic duration

Optional object that supplies expected minimum and maximum durations for items that should belong in this dataset. Useful when the same S3 location contains a mix of mosaic assets following the same file naming convention, but these need to be split into different catalogues based on the asset mosaic duration. Durations are given as ISO duration format.

Example of a rule matching items with a duration ranging from 5 to 10 days:

  "mosaicDuration": {
      "min": "P5D",
      "max": "P10D"
  }

Internal structure

The main entry point to the tool is in stac_builder.py. This reads the dataset and S3 configuration files and invokes item_builder and then catalog_builder.

The item_builder iterates through all S3 prefixes and lists all files within those prefixes producing items for the files. If it finds a file for which no item exists, it uses the template within the configuration to create a new file. If it finds a file for which there is no band in the existing STAC item, it updates the band information. If the configuration has metadata information and the process finds a metadata file, it reads the metadata and inejcts it into the STAC item.

The catalog_builder reads all item files and produces a two stage catalogue where dataset-time catalogues are created based on the configured time frame(dataset-time.timeFrame). These dataset-time catalogues will contain links to items that intersect with that period. The dataset catalogue links to each dataset-time catalogue and includes information about which times each dataset-time catalog spans.

fmidev / stac-builder Goto Github PK

stac-builder's Introduction

STAC builder for Tuulituhohaukka

Devcontainer

Run STAC builder

How does it work?

How do you use it?

Run S3 connection test

Configuration files

S3 config

Dataset configuration

Item file naming convention

Item ID template

Item metadata extraction

Item mosaic duration

Internal structure

stac-builder's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs