nansencenter / metanorm Goto Github PK
View Code? Open in Web Editor NEWMetadata normalizing tool
License: GNU General Public License v3.0
Metadata normalizing tool
License: GNU General Public License v3.0
Migrate CMEMS in situ TAC normalizer based on cmems_in_situ_tac.py.
https://scihub.copernicus.eu/news/News00868
The sentinel SAFE normalizer needs to be updated.
Migrate CPOM normalizer based on CPOM.py.
return [pti.get_wkv_variable('surface_backwards_scattering_coefficient_of_radar_wave')]
The links to the THREDDS repositories for each region are listed here: https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/navoceano-hycom-glb.
Example: https://www.ncei.noaa.gov/thredds-coastal/catalog/hycom_region1/catalog.html
Needed for nansencenter/django-geo-spaas-harvesting#58
Metadata specification: https://cdn.earthdata.nasa.gov/umm/granule/v1.6.3
As mentioned in nansencenter/django-geo-spaas-harvesting#40 Python lists are only needed when we want to save the order of elements inside it. When we don't need order, there is no need to use a python list!!!!
Sets should be used for the case that we add another element in it afterwards.
tuple should be used for the case that we do not need to change it afterwards.
For the speed of running the code the order of selection data types is as follows:
1.tuple
2.set
3.list
If the summary from the metadata contains only spaces, it should be replaced by the default summary
Make a normalizer for the metadata returned by the Creodias EO finder API.
Example: https://finder.creodias.eu/resto/collections/Sentinel3/5edf28f1-3042-5ddf-9286-2efb430e94bd.json?&lang=en
The values of datasets parameters, as returned by pythesint, are written multiple times in the unit tests.
They should all be written once in the DATASET_PARAMETERS
dictionary.
Migrate AVISO normalizer based on aviso.py.
It should also be generalized, as it only deals with one file for now.
Create a GPortal GCOM normalizer from url.py.
I've got a CSv file from Canada with metadata like that:
Result Number,Satellite,Date,Beam Mode,Polarization,Type,Image Id,Image Info,Metadata,Reason,Sensor Mode,Orbit Direction,Order Key,SIP Size (MB),Service UUID,Footprint,Look Orientation,Band,Title,Options,Absolute Orbit,Orderable
1,RADARSAT-2,2020-07-16 02:14:27 GMT,ScanSAR Wide A (W1 W2 W3 S7),HH HV,SGF,831163,"{""headers"":[""Product Type"" ""LUT Applied"" ""Sampled Pixel Spacing (Panchromatic)"" ""Product Format"" ""Geodetic Terrain Height""] ""relatedProducts"":[{""values"":[""SGF"" ""Ice"" ""100.0"" ""GeoTIFF"" ""0.00186""]}] ""collectionID"":""Radarsat2"" ""imageID"":""7337877""}",dummy value,,ScanSAR Wide,Ascending,RS2_OK121511_PK1076349_DK1021326_SCWA_20200716_021427_HH_HV_SGF,84,SERVICE-RSAT2_001-000000000000000000,-146.008396 73.905427 -143.459486 72.212173 -127.936480 73.451549 -128.875274 75.249738 -146.008396 73.905427 ,Right,C,rsat2_20200716_N7370W13656,,65706,TRUE
I want to make a small ingester for this kind of files (maybe part of harvesting) but I need metanorm to interpret this metadat correcttly.
Since #25, the django
package is no longer required in the test image.
I think it is good to augment the code in this line:
metanorm/metanorm/normalizers/base.py
Line 75 in d09a290
new_members = getattr(self, 'get_' + param)(raw_attributes) or []
The reason is that if the get_dataset_parameter does not return anything in any normalizer, then the new_member
becomes None and this causes error in the next line:
metanorm/metanorm/normalizers/base.py
Line 76 in d09a290
So for the safety of harvesting process, I think it is good to give it second chance to be []
by using python or
so that the Nonetype error is never raise!
I have seen this error when I accidentaly wrote the incorrect get_dataset_parameter
which gives None
sometimes!
What is your feedback @aperrin66 ?
The time coverage for the following dataset is wrong: ftp://nrt.cmems-du.eu/Core/SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046/dataset-duacs-nrt-global-merged-allsat-phy-l4/2020/11/nrt_global_allsat_phy_l4_20201101_20201104.nc
When downloading the dataset and looking at its metadata, the time coverage is the following:
But the time coverage returned by the normalizer is 2020-11-01T00:00:00Z to 2020-11-02T00:00:00Z.
The time coverage in all cases where the URL normalizer is used must be checked.
Metanorm was originally designed this way: each normalizer takes care of one metadata convention, then passes responsibility for the attributes it could not fill to the next normalizer.
Looking at the state of the code now, it appears that this is only applicable in some rare cases. The metadata conventions are followed so loosely and vary so much from one metadata provider to the next that most normalizers end up being specific to a provider.
This results in weird and/or inefficient code which must reconcile real world cases with the original design of metanorm.
We could probably make the code both more simple and efficient by having a structure like this:
UPDATE:
The base structure is in place, and I migrated the Creodias normalizer to have a simple example.
Here are the remaining normalizers to migrate/create (hopefully I did not forget any):
For each of these, please create a branch from issue81_simplification_refactoring, do the modifications, and open a pull request with issue81_simplification_refactoring as target branch.
The new normalizers will be put in the metanorm/normalizers/geospaas/
folder.
The new Creodias normalizer can be taken as example.
Once all the normalizers have been migrated, we can remove the old base classes and move on to adapt geospaas_harvesting.
Replace it with REMSS
One or many normalizers must be added that contains hard-coded methods of meta-data in order to harvest some specific ftp servers
Migrate Earthdata CMR normalizer based on earthdata_cmr.py.
The data providers sometimes provide keywords that don't allow pythesint to find a resource.
For example, for platforms, PODAAC provides "METOP_B" but pythesint understands "METOP-B".
We need a way to deal with this kind of situation.
Migrate the Sentinel SAFE (Copernicus scihub format) normalizer based on sentinel_safe.py and sentinel1_identifier.py.
This will probably result in several normalizer inheriting from a base Sentinel SAFE normalizer.
Migrate OSISAF normalizer based on osisaf.py.
Useful links:
Migrate Radarsat 2 normalizer based on radarsat2_csv.py.
In the case when two files have the same time and the same coordinates and other parameters (e.g. GW1AM2__01D_EQOA and GW1AM2__01D_EQOA) only one Dataset is created which has link to two URIs.
Instead two Datasets should be created, each pointing into two different files with different filenames.
Add the necessary data in the URL normalizer.
https://ftp.opc.ncep.noaa.gov/grids/operational/GLOBALHYCOM/Navy/
Create a CEDA ESA CCI climatology normalizer based on url.py.
The code dealing with time coverage in the URL normalizer is unreadable and repetitive.
It should be rewritten in a clearer way.
There are also mistakes to fix in the time coverage for some sources.
As far as I know, the pti repo uses two lists, and this two list provide different outputs sometimes. For example one list provides:
It must be justified which one should we use for the future. This causes problems in the future especially in the case of working with cumulative parameters in its corresponding repetative mechanism that we developed recently in the metanorm repo.
Up to now, we have used search_cf_standard_name_list
method of pti for normalizers in metanorm or sometimes get_cf_standard_name
method of it.
BUT, if we use get_wkt_variable method, it will causes above mentioned inconsistency and may cause repetitive parameter(that have the same statndard name) assignment to the same dataset.
Please, justify about the usage of pti dear @akorosov .
The output_cumulative_parameter_names
to MetadataHandler.__init__()
should be optional.
This is connected to issue in harvesters: nansencenter/django-geo-spaas-harvesting#16
Normalizers for metadata from ingestors (similar to SentinelSAFEMetadataNormalizer) and from Identifier (similar to SentinelOneIdentifierMetadataNormalizer) should be added.
The identifier normalizer should add Parameters:
standard_name = sea_ice_area_fraction
It is important to keep in mid extendability to add more sources of data and more parameters from OSISAF.
Adapt the CMEMSInSituTACMetadataNormalizer to support the INSITU_GLO_UV_NRT_OBSERVATIONS_013_048 product.
The URL normalizer contains a lot of hard coded data.
This data should be moved to a new file to improve the readability of the normalizer.
Related to nansencenter/django-geo-spaas-harvesting#44
The regex used to extract the entry_id from thredds.met.no URLs should work for any URL starting with "https://thredds.met.no/thredds/".
The Sentinel 2 datasets from Copernicus scihub are not correctly associated with the corresponding GCMD platforms and instruments.
Edit: it turns out we will probably need to download the INSITU_GLO_UV_NRT_OBSERVATIONS_013_048 product and harvest it locally, so it will be the subject of another issue.
hardcoded values are needed for three new data source in nansencenter/django-geo-spaas-harvesting#33 (comment)
Normalizers create instance of GeosGeometry and return to Harvesting, but that creates a dependency on Django. To reduce dependecies (and make Metanorm easier integrated in other software, e.g. Nansat) GeosGeometry should be created in Harvesters and Metanorm should return only strings.
Create PODAAC normalizer based on acdd.py and geospatial_well_known.py.
Add a normalizer for CMEMS's INSITU_GLO_NRT_OBSERVATIONS_013_030 product
Create a REMSS GMI normalizer based on url.py
Develop Sentinel1FilenameMetadataNormalizer that can retrieve
From filename
Example filename:
ftp://ftp.nersc.no/nansat/test_data/sentinel1_l1/S1A_EW_GRDM_1SDH_20150702T172954_20150702T173054_006635_008DA5_55D1.zip
And (incomplete) example of unittest
class SentinelFilenameMetadataNormalizerTests(unittest.TestCase):
"""Tests for the SentinelFilenameMetadataNormalizer """
def setUp(self):
self.normalizer = normalizers.SentinelFilenameMetadataNormalizer([])
def test_platform(self):
""" shall return platform from SentinelSAFEMetadataNormalizer """
attributes = {'filename': 'S1A_...'}
self.assertEqual(self.normalizer.get_platform(attributes), pythesing_like_representation('Sentinel-1A'))
The summary field should include:
These should be added with a structure eliminating possible ambiguities when appending to an existing summary.
They went from "capitalized" words ("Sentinel-1") to fully uppercase words ("SENTINEL-1").
We need to make sure that the output from metanorm
is always the same, regardless of what pythesint
provides.
Replace if self.match_metadata(raw_attributes):
with a more generic checks for each metadata.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.