nansencenter / metanorm Goto Github PK

Metadata normalizing tool

License: GNU General Public License v3.0

Python 100.00%

metanorm's Introduction

Metadata normalizing tool

The purpose of this tool is to extract a defined set of parameter from raw metadata. It is meant primarily for use with geo-spatial datasets, but can be extend to process any kind of data.

Principle

Input: raw metadata attributes in the form of a dictionary

Output: a dictionary in which normalized parameter names are associated with values found in the raw metadata.

The actual work of extracting attribute values from the raw metadata is done by normalizers.

Each normalizer is a class able to deal with a particular type of metadata. In the case of geo-spatial datasets, a normalizer is typically able to deal with the metadata format of a particular data provider.

Usage

Although normalizers can be used directly, the easiest way to normalize metadata is to use a MetadataHandler. A metadata handler is initialized using a base normalizer class.

When trying to normalize metadata (using the handler's get_parameters() method), the handler tries all normalizers which inherit from this base class. The first normalizer which is able to deal with the metadata is used.

To determine if a normalizer is able to deal with a dictionary of raw metadata, the handler calls its check() method.

Example to normalize data for use in django-geo-spaas:

import metanorm.handlers as handlers
import metanorm.normalizers as normalizers

metadata_to_normalize = {
  'foo': 'bar,
  'baz': 'qux'
}

m = handlers.MetadataHandler(normalizers.geospaas.GeoSPaaSMetadataNormalizer)
normalized_metadata = m.get_parameters(metadata_to_normalize)

metanorm's People

Contributors

Watchers

Forkers

aperrin66

metanorm's Issues

The django package is no longer useful in the test image

Since #25, the django package is no longer required in the test image.

Add normalizers for OSISAF data

This is connected to issue in harvesters: nansencenter/django-geo-spaas-harvesting#16

Normalizers for metadata from ingestors (similar to SentinelSAFEMetadataNormalizer) and from Identifier (similar to SentinelOneIdentifierMetadataNormalizer) should be added.

The identifier normalizer should add Parameters:
standard_name = sea_ice_area_fraction

It is important to keep in mid extendability to add more sources of data and more parameters from OSISAF.

Make rs2csv normalizer more generic - to normalize metadata from radarsat portal

Replace if self.match_metadata(raw_attributes): with a more generic checks for each metadata.

Keep backwards compatibility in MetadataHandler's constructor arguments

The output_cumulative_parameter_names to MetadataHandler.__init__() should be optional.

hardcoded values are needed for three new data source

hardcoded values are needed for three new data source in nansencenter/django-geo-spaas-harvesting#33 (comment)

Datasets with identical metadata may link with multiple URIs

In the case when two files have the same time and the same coordinates and other parameters (e.g. GW1AM2__01D_EQOA and GW1AM2__01D_EQOA) only one Dataset is created which has link to two URIs.

Instead two Datasets should be created, each pointing into two different files with different filenames.

Make the thredds.met.no entry_id regex more generic

The regex used to extract the entry_id from thredds.met.no URLs should work for any URL starting with "https://thredds.met.no/thredds/".

Add information to the summary field

The summary field should include:

the processing level (L1, L2, etc...)
for models, the name of the model

These should be added with a structure eliminating possible ambiguities when appending to an existing summary.

Migrate OSISAF normalizer

Migrate OSISAF normalizer based on osisaf.py.

Add normalizer for Radarsat2 CSV file from CSA

I've got a CSv file from Canada with metadata like that:

Result Number,Satellite,Date,Beam Mode,Polarization,Type,Image Id,Image Info,Metadata,Reason,Sensor Mode,Orbit Direction,Order Key,SIP Size (MB),Service UUID,Footprint,Look Orientation,Band,Title,Options,Absolute Orbit,Orderable
1,RADARSAT-2,2020-07-16 02:14:27 GMT,ScanSAR Wide A (W1 W2 W3 S7),HH HV,SGF,831163,"{""headers"":[""Product Type"" ""LUT Applied"" ""Sampled Pixel Spacing (Panchromatic)"" ""Product Format"" ""Geodetic Terrain Height""] ""relatedProducts"":[{""values"":[""SGF"" ""Ice"" ""100.0"" ""GeoTIFF"" ""0.00186""]}] ""collectionID"":""Radarsat2"" ""imageID"":""7337877""}",dummy value,,ScanSAR Wide,Ascending,RS2_OK121511_PK1076349_DK1021326_SCWA_20200716_021427_HH_HV_SGF,84,SERVICE-RSAT2_001-000000000000000000,-146.008396 73.905427 -143.459486 72.212173 -127.936480 73.451549 -128.875274 75.249738 -146.008396 73.905427 ,Right,C,rsat2_20200716_N7370W13656,,65706,TRUE

I want to make a small ingester for this kind of files (maybe part of harvesting) but I need metanorm to interpret this metadat correcttly.

Migrate CPOM normalizer

Migrate CPOM normalizer based on CPOM.py.

Wrong time coverage in the URL normalizer

The time coverage for the following dataset is wrong: ftp://nrt.cmems-du.eu/Core/SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046/dataset-duacs-nrt-global-merged-allsat-phy-l4/2020/11/nrt_global_allsat_phy_l4_20201101_20201104.nc

When downloading the dataset and looking at its metadata, the time coverage is the following:

time_coverage_start=2020-10-31T12:00:00Z
time_coverage_end=2020-11-01T12:00:00Z

But the time coverage returned by the normalizer is 2020-11-01T00:00:00Z to 2020-11-02T00:00:00Z.

The time coverage in all cases where the URL normalizer is used must be checked.

Sentinel 2 platforms and instruments are not created from GCMD

The Sentinel 2 datasets from Copernicus scihub are not correctly associated with the corresponding GCMD platforms and instruments.

Create a CEDA ESA CCI climatology normalizer

Create a CEDA ESA CCI climatology normalizer based on url.py.

Move the URL normalizer's hard coded data to a separate file

The URL normalizer contains a lot of hard coded data.
This data should be moved to a new file to improve the readability of the normalizer.

Normalizer(s) are needed for using the harvesting repo with the purpose of harvesting some specific ftp servers

One or many normalizers must be added that contains hard-coded methods of meta-data in order to harvest some specific ftp servers

Add new normalizer based on filename

Develop Sentinel1FilenameMetadataNormalizer that can retrieve

entry_id
platform
instrument
time_coverage_start
time_coverage_end
provider

From filename

Example filename:
ftp://ftp.nersc.no/nansat/test_data/sentinel1_l1/S1A_EW_GRDM_1SDH_20150702T172954_20150702T173054_006635_008DA5_55D1.zip

And (incomplete) example of unittest

class SentinelFilenameMetadataNormalizerTests(unittest.TestCase):
    """Tests for the SentinelFilenameMetadataNormalizer """

    def setUp(self):
        self.normalizer = normalizers.SentinelFilenameMetadataNormalizer([])

    def test_platform(self):
        """ shall return platform from SentinelSAFEMetadataNormalizer """
        attributes = {'filename': 'S1A_...'}
        self.assertEqual(self.normalizer.get_platform(attributes), pythesing_like_representation('Sentinel-1A'))

Create PODAAC normalizer

Create PODAAC normalizer based on acdd.py and geospatial_well_known.py.

Copernicus apihub URL changed

https://scihub.copernicus.eu/news/News00868

The sentinel SAFE normalizer needs to be updated.

Refactor the URL normalizer's time coverage code

The code dealing with time coverage in the URL normalizer is unreadable and repetitive.
It should be rewritten in a clearer way.

There are also mistakes to fix in the time coverage for some sources.

read aviso and cpom sea level local files

Sometime the get_dataset_parameter returns None!

I think it is good to augment the code in this line:

metanorm/metanorm/normalizers/base.py

Line 75 in d09a290

new_members = getattr(self, 'get_' + param)(raw_attributes)

to be modifed as:

                new_members = getattr(self, 'get_' + param)(raw_attributes) or []

The reason is that if the get_dataset_parameter does not return anything in any normalizer, then the new_member becomes None and this causes error in the next line:

metanorm/metanorm/normalizers/base.py

Line 76 in d09a290

for new_member in new_members:

and raise an error during the harvesting process.

So for the safety of harvesting process, I think it is good to give it second chance to be [] by using python or so that the Nonetype error is never raise!

I have seen this error when I accidentaly wrote the incorrect get_dataset_parameter which gives None sometimes!

What is your feedback @aperrin66 ?

Translate providers' keywords to something pythesint understands

The data providers sometimes provide keywords that don't allow pythesint to find a resource.
For example, for platforms, PODAAC provides "METOP_B" but pythesint understands "METOP-B".

We need a way to deal with this kind of situation.

python data types revisited!

@aperrin66
@akorosov

As mentioned in nansencenter/django-geo-spaas-harvesting#40 Python lists are only needed when we want to save the order of elements inside it. When we don't need order, there is no need to use a python list!!!!

Sets should be used for the case that we add another element in it afterwards.
tuple should be used for the case that we do not need to change it afterwards.

For the speed of running the code the order of selection data types is as follows:

1.tuple
2.set
3.list

Create a GPortal GCOM normalizer

Create a GPortal GCOM normalizer from url.py.

Migrate the Sentinel SAFE normalizer

Migrate the Sentinel SAFE (Copernicus scihub format) normalizer based on sentinel_safe.py and sentinel1_identifier.py.
This will probably result in several normalizer inheriting from a base Sentinel SAFE normalizer.

Simplification of metanorm

Metanorm was originally designed this way: each normalizer takes care of one metadata convention, then passes responsibility for the attributes it could not fill to the next normalizer.

Looking at the state of the code now, it appears that this is only applicable in some rare cases. The metadata conventions are followed so loosely and vary so much from one metadata provider to the next that most normalizers end up being specific to a provider.

This results in weird and/or inefficient code which must reconcile real world cases with the original design of metanorm.

We could probably make the code both more simple and efficient by having a structure like this:

one normalizer per provider
each normalizer has a method that can tell from the raw attributes if the normalizer can be used
each normalizer provides the necessary methods to fill all the attributes

UPDATE:
The base structure is in place, and I migrated the Creodias normalizer to have a simple example.

Here are the remaining normalizers to migrate/create (hopefully I did not forget any):

For each of these, please create a branch from issue81_simplification_refactoring, do the modifications, and open a pull request with issue81_simplification_refactoring as target branch.

The new normalizers will be put in the metanorm/normalizers/geospaas/ folder.

The new Creodias normalizer can be taken as example.

Once all the normalizers have been migrated, we can remove the old base classes and move on to adapt geospaas_harvesting.

CMEMS in situ TAC: fix summary

If the summary from the metadata contains only spaces, it should be replaced by the default summary

Migrate AVISO normalizer

Migrate AVISO normalizer based on aviso.py.
It should also be generalized, as it only deals with one file for now.

Make a normalizer for Creodias metadata

Make a normalizer for the metadata returned by the Creodias EO finder API.
Example: https://finder.creodias.eu/resto/collections/Sentinel3/5edf28f1-3042-5ddf-9286-2efb430e94bd.json?&lang=en

SentinelOneIdentifier should also return Parameters

To add get_parameters() method which returns a list with one entry:

return [pti.get_wkv_variable('surface_backwards_scattering_coefficient_of_radar_wave')]

Don't repeat dataset parameters in the URL normalizer tests

The values of datasets parameters, as returned by pythesint, are written multiple times in the unit tests.
They should all be written once in the DATASET_PARAMETERS dictionary.

Add support for Earthdata CMR metadata

Needed for nansencenter/django-geo-spaas-harvesting#58

Metadata specification: https://cdn.earthdata.nasa.gov/umm/granule/v1.6.3

Add product info to the CCI Climatology summary

Add support for NOAA HYCOM THREDDS data

The links to the THREDDS repositories for each region are listed here: https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/navoceano-hycom-glb.

Example: https://www.ncei.noaa.gov/thredds-coastal/catalog/hycom_region1/catalog.html

Metanorm depends on Django

Normalizers create instance of GeosGeometry and return to Harvesting, but that creates a dependency on Django. To reduce dependecies (and make Metanorm easier integrated in other software, e.g. Nansat) GeosGeometry should be created in Harvesters and Metanorm should return only strings.

Add support for new CMEMS products in the URL normalizer

MEDSEA_ANALYSIS_FORECAST_PHY_006_013
IBI_ANALYSIS_FORECAST_PHYS_005_001
~~INSITU_GLO_UV_NRT_OBSERVATIONS_013_048~~

Edit: it turns out we will probably need to download the INSITU_GLO_UV_NRT_OBSERVATIONS_013_048 product and harvest it locally, so it will be the subject of another issue.

Add support for CMEMS INSITU_GLO_UV_NRT_OBSERVATIONS_013_048

Adapt the CMEMSInSituTACMetadataNormalizer to support the INSITU_GLO_UV_NRT_OBSERVATIONS_013_048 product.

Add Timeliness Category: Fast-24h for Sentinel-1

Add support for NOAA's RTOFS Global product

Useful links:

product website: https://polar.ncep.noaa.gov/global/
data access: https://polar.ncep.noaa.gov/global/data_access.shtml
file naming conventions: https://polar.ncep.noaa.gov/global/about/product_description.shtml

Add support for NOAA HYCOM FTP data

Add the necessary data in the URL normalizer.

https://ftp.opc.ncep.noaa.gov/grids/operational/GLOBALHYCOM/Navy/

The GCMD keywords for the Sentinel platforms changed again

They went from "capitalized" words ("Sentinel-1") to fully uppercase words ("SENTINEL-1").
We need to make sure that the output from metanorm is always the same, regardless of what pythesint provides.

Migrate CMEMS in situ TAC normalizer

Migrate CMEMS in situ TAC normalizer based on cmems_in_situ_tac.py.

Create a REMSS GMI normalizer

Create a REMSS GMI normalizer based on url.py

NASA/GSFC/SED/ESD/LA/GPM seems to have been removed from GCMD providers

Replace it with REMSS

hotfix1-bug-in-regex

When using pti, which list must be used?

As far as I know, the pti repo uses two lists, and this two list provide different outputs sometimes. For example one list provides:

and the other one is :

It must be justified which one should we use for the future. This causes problems in the future especially in the case of working with cumulative parameters in its corresponding repetative mechanism that we developed recently in the metanorm repo.

Up to now, we have used search_cf_standard_name_list method of pti for normalizers in metanorm or sometimes get_cf_standard_name method of it.
BUT, if we use get_wkt_variable method, it will causes above mentioned inconsistency and may cause repetitive parameter(that have the same statndard name) assignment to the same dataset.

Please, justify about the usage of pti dear @akorosov .

nansencenter / metanorm Goto Github PK

metanorm's Introduction

Metadata normalizing tool

Principle

Usage

metanorm's People

Contributors

Watchers

Forkers

metanorm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs