nansencenter / metanorm Goto Github PK

View Code? Open in Web Editor NEW

0.0 8.0 1.0 530 KB

Metadata normalizing tool

License: GNU General Public License v3.0

Python 100.00%

geospaas

metanorm's Issues

Migrate CMEMS in situ TAC normalizer

Migrate CMEMS in situ TAC normalizer based on cmems_in_situ_tac.py.

Copernicus apihub URL changed

https://scihub.copernicus.eu/news/News00868

The sentinel SAFE normalizer needs to be updated.

read aviso and cpom sea level local files

Migrate CPOM normalizer

Migrate CPOM normalizer based on CPOM.py.

SentinelOneIdentifier should also return Parameters

To add get_parameters() method which returns a list with one entry:

return [pti.get_wkv_variable('surface_backwards_scattering_coefficient_of_radar_wave')]

Add support for NOAA HYCOM THREDDS data

The links to the THREDDS repositories for each region are listed here: https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/navoceano-hycom-glb.

Example: https://www.ncei.noaa.gov/thredds-coastal/catalog/hycom_region1/catalog.html

Add support for Earthdata CMR metadata

Needed for nansencenter/django-geo-spaas-harvesting#58

Metadata specification: https://cdn.earthdata.nasa.gov/umm/granule/v1.6.3

As mentioned in nansencenter/django-geo-spaas-harvesting#40 Python lists are only needed when we want to save the order of elements inside it. When we don't need order, there is no need to use a python list!!!!

Sets should be used for the case that we add another element in it afterwards.
tuple should be used for the case that we do not need to change it afterwards.

For the speed of running the code the order of selection data types is as follows:

1.tuple
2.set
3.list

CMEMS in situ TAC: fix summary

If the summary from the metadata contains only spaces, it should be replaced by the default summary

Make a normalizer for Creodias metadata

Make a normalizer for the metadata returned by the Creodias EO finder API.
Example: https://finder.creodias.eu/resto/collections/Sentinel3/5edf28f1-3042-5ddf-9286-2efb430e94bd.json?&lang=en

Don't repeat dataset parameters in the URL normalizer tests

The values of datasets parameters, as returned by pythesint, are written multiple times in the unit tests.
They should all be written once in the DATASET_PARAMETERS dictionary.

Migrate AVISO normalizer

Migrate AVISO normalizer based on aviso.py.
It should also be generalized, as it only deals with one file for now.

Create a GPortal GCOM normalizer

Create a GPortal GCOM normalizer from url.py.

Add normalizer for Radarsat2 CSV file from CSA

I've got a CSv file from Canada with metadata like that:

Result Number,Satellite,Date,Beam Mode,Polarization,Type,Image Id,Image Info,Metadata,Reason,Sensor Mode,Orbit Direction,Order Key,SIP Size (MB),Service UUID,Footprint,Look Orientation,Band,Title,Options,Absolute Orbit,Orderable
1,RADARSAT-2,2020-07-16 02:14:27 GMT,ScanSAR Wide A (W1 W2 W3 S7),HH HV,SGF,831163,"{""headers"":[""Product Type"" ""LUT Applied"" ""Sampled Pixel Spacing (Panchromatic)"" ""Product Format"" ""Geodetic Terrain Height""] ""relatedProducts"":[{""values"":[""SGF"" ""Ice"" ""100.0"" ""GeoTIFF"" ""0.00186""]}] ""collectionID"":""Radarsat2"" ""imageID"":""7337877""}",dummy value,,ScanSAR Wide,Ascending,RS2_OK121511_PK1076349_DK1021326_SCWA_20200716_021427_HH_HV_SGF,84,SERVICE-RSAT2_001-000000000000000000,-146.008396 73.905427 -143.459486 72.212173 -127.936480 73.451549 -128.875274 75.249738 -146.008396 73.905427 ,Right,C,rsat2_20200716_N7370W13656,,65706,TRUE

I want to make a small ingester for this kind of files (maybe part of harvesting) but I need metanorm to interpret this metadat correcttly.

Add Timeliness Category: Fast-24h for Sentinel-1

The django package is no longer useful in the test image

Since #25, the django package is no longer required in the test image.

Sometime the get_dataset_parameter returns None!

I think it is good to augment the code in this line:

metanorm/metanorm/normalizers/base.py

Line 75 in d09a290

new_members = getattr(self, 'get_' + param)(raw_attributes)

to be modifed as:

                new_members = getattr(self, 'get_' + param)(raw_attributes) or []

The reason is that if the get_dataset_parameter does not return anything in any normalizer, then the new_member becomes None and this causes error in the next line:

metanorm/metanorm/normalizers/base.py

Line 76 in d09a290

for new_member in new_members:

and raise an error during the harvesting process.

So for the safety of harvesting process, I think it is good to give it second chance to be [] by using python or so that the Nonetype error is never raise!

I have seen this error when I accidentaly wrote the incorrect get_dataset_parameter which gives None sometimes!

What is your feedback @aperrin66 ?

Wrong time coverage in the URL normalizer

The time coverage for the following dataset is wrong: ftp://nrt.cmems-du.eu/Core/SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046/dataset-duacs-nrt-global-merged-allsat-phy-l4/2020/11/nrt_global_allsat_phy_l4_20201101_20201104.nc

When downloading the dataset and looking at its metadata, the time coverage is the following:

time_coverage_start=2020-10-31T12:00:00Z
time_coverage_end=2020-11-01T12:00:00Z

But the time coverage returned by the normalizer is 2020-11-01T00:00:00Z to 2020-11-02T00:00:00Z.

The time coverage in all cases where the URL normalizer is used must be checked.

Simplification of metanorm

Metanorm was originally designed this way: each normalizer takes care of one metadata convention, then passes responsibility for the attributes it could not fill to the next normalizer.

Looking at the state of the code now, it appears that this is only applicable in some rare cases. The metadata conventions are followed so loosely and vary so much from one metadata provider to the next that most normalizers end up being specific to a provider.

This results in weird and/or inefficient code which must reconcile real world cases with the original design of metanorm.

We could probably make the code both more simple and efficient by having a structure like this:

one normalizer per provider
each normalizer has a method that can tell from the raw attributes if the normalizer can be used
each normalizer provides the necessary methods to fill all the attributes

UPDATE:
The base structure is in place, and I migrated the Creodias normalizer to have a simple example.

Here are the remaining normalizers to migrate/create (hopefully I did not forget any):

For each of these, please create a branch from issue81_simplification_refactoring, do the modifications, and open a pull request with issue81_simplification_refactoring as target branch.

The new normalizers will be put in the metanorm/normalizers/geospaas/ folder.

The new Creodias normalizer can be taken as example.

Once all the normalizers have been migrated, we can remove the old base classes and move on to adapt geospaas_harvesting.

NASA/GSFC/SED/ESD/LA/GPM seems to have been removed from GCMD providers

Replace it with REMSS

Normalizer(s) are needed for using the harvesting repo with the purpose of harvesting some specific ftp servers

One or many normalizers must be added that contains hard-coded methods of meta-data in order to harvest some specific ftp servers

Migrate Earthdata CMR normalizer

Migrate Earthdata CMR normalizer based on earthdata_cmr.py.

Translate providers' keywords to something pythesint understands

The data providers sometimes provide keywords that don't allow pythesint to find a resource.
For example, for platforms, PODAAC provides "METOP_B" but pythesint understands "METOP-B".

We need a way to deal with this kind of situation.

Migrate the Sentinel SAFE normalizer

Migrate the Sentinel SAFE (Copernicus scihub format) normalizer based on sentinel_safe.py and sentinel1_identifier.py.
This will probably result in several normalizer inheriting from a base Sentinel SAFE normalizer.

Migrate OSISAF normalizer

Migrate OSISAF normalizer based on osisaf.py.

Add support for NOAA's RTOFS Global product

Useful links:

product website: https://polar.ncep.noaa.gov/global/
data access: https://polar.ncep.noaa.gov/global/data_access.shtml
file naming conventions: https://polar.ncep.noaa.gov/global/about/product_description.shtml

Migrate Radarsat 2 normalizer

Migrate Radarsat 2 normalizer based on radarsat2_csv.py.

Datasets with identical metadata may link with multiple URIs

In the case when two files have the same time and the same coordinates and other parameters (e.g. GW1AM2__01D_EQOA and GW1AM2__01D_EQOA) only one Dataset is created which has link to two URIs.

Instead two Datasets should be created, each pointing into two different files with different filenames.

Add support for NOAA HYCOM FTP data

Add the necessary data in the URL normalizer.

https://ftp.opc.ncep.noaa.gov/grids/operational/GLOBALHYCOM/Navy/

Create a CEDA ESA CCI climatology normalizer

Create a CEDA ESA CCI climatology normalizer based on url.py.

Refactor the URL normalizer's time coverage code

The code dealing with time coverage in the URL normalizer is unreadable and repetitive.
It should be rewritten in a clearer way.

There are also mistakes to fix in the time coverage for some sources.

When using pti, which list must be used?

As far as I know, the pti repo uses two lists, and this two list provide different outputs sometimes. For example one list provides:

and the other one is :

It must be justified which one should we use for the future. This causes problems in the future especially in the case of working with cumulative parameters in its corresponding repetative mechanism that we developed recently in the metanorm repo.

Up to now, we have used search_cf_standard_name_list method of pti for normalizers in metanorm or sometimes get_cf_standard_name method of it.
BUT, if we use get_wkt_variable method, it will causes above mentioned inconsistency and may cause repetitive parameter(that have the same statndard name) assignment to the same dataset.

Please, justify about the usage of pti dear @akorosov .

Keep backwards compatibility in MetadataHandler's constructor arguments

The output_cumulative_parameter_names to MetadataHandler.__init__() should be optional.

Add normalizers for OSISAF data

This is connected to issue in harvesters: nansencenter/django-geo-spaas-harvesting#16

Normalizers for metadata from ingestors (similar to SentinelSAFEMetadataNormalizer) and from Identifier (similar to SentinelOneIdentifierMetadataNormalizer) should be added.

The identifier normalizer should add Parameters:
standard_name = sea_ice_area_fraction

It is important to keep in mid extendability to add more sources of data and more parameters from OSISAF.

Add support for CMEMS INSITU_GLO_UV_NRT_OBSERVATIONS_013_048

Adapt the CMEMSInSituTACMetadataNormalizer to support the INSITU_GLO_UV_NRT_OBSERVATIONS_013_048 product.

Move the URL normalizer's hard coded data to a separate file

The URL normalizer contains a lot of hard coded data.
This data should be moved to a new file to improve the readability of the normalizer.

Make the thredds.met.no entry_id regex more generic

The regex used to extract the entry_id from thredds.met.no URLs should work for any URL starting with "https://thredds.met.no/thredds/".

Sentinel 2 platforms and instruments are not created from GCMD

The Sentinel 2 datasets from Copernicus scihub are not correctly associated with the corresponding GCMD platforms and instruments.

Add support for new CMEMS products in the URL normalizer

MEDSEA_ANALYSIS_FORECAST_PHY_006_013
IBI_ANALYSIS_FORECAST_PHYS_005_001
~~INSITU_GLO_UV_NRT_OBSERVATIONS_013_048~~

Edit: it turns out we will probably need to download the INSITU_GLO_UV_NRT_OBSERVATIONS_013_048 product and harvest it locally, so it will be the subject of another issue.

hardcoded values are needed for three new data source

hardcoded values are needed for three new data source in nansencenter/django-geo-spaas-harvesting#33 (comment)

Metanorm depends on Django

Normalizers create instance of GeosGeometry and return to Harvesting, but that creates a dependency on Django. To reduce dependecies (and make Metanorm easier integrated in other software, e.g. Nansat) GeosGeometry should be created in Harvesters and Metanorm should return only strings.

Create PODAAC normalizer

Create PODAAC normalizer based on acdd.py and geospatial_well_known.py.

Add support for CMEMS's INSITU_GLO_NRT_OBSERVATIONS_013_030

Add a normalizer for CMEMS's INSITU_GLO_NRT_OBSERVATIONS_013_030 product

Create a REMSS GMI normalizer

Create a REMSS GMI normalizer based on url.py

Add new normalizer based on filename

Develop Sentinel1FilenameMetadataNormalizer that can retrieve

entry_id
platform
instrument
time_coverage_start
time_coverage_end
provider

From filename

Example filename:
ftp://ftp.nersc.no/nansat/test_data/sentinel1_l1/S1A_EW_GRDM_1SDH_20150702T172954_20150702T173054_006635_008DA5_55D1.zip

And (incomplete) example of unittest

class SentinelFilenameMetadataNormalizerTests(unittest.TestCase):
    """Tests for the SentinelFilenameMetadataNormalizer """

    def setUp(self):
        self.normalizer = normalizers.SentinelFilenameMetadataNormalizer([])

    def test_platform(self):
        """ shall return platform from SentinelSAFEMetadataNormalizer """
        attributes = {'filename': 'S1A_...'}
        self.assertEqual(self.normalizer.get_platform(attributes), pythesing_like_representation('Sentinel-1A'))

Add information to the summary field

The summary field should include:

the processing level (L1, L2, etc...)
for models, the name of the model

These should be added with a structure eliminating possible ambiguities when appending to an existing summary.

nansencenter / metanorm Goto Github PK

metanorm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs