GithubHelp home page GithubHelp logo

metanorm's Introduction

Unit tests and builds Coverage Status

Metadata normalizing tool

The purpose of this tool is to extract a defined set of parameter from raw metadata. It is meant primarily for use with geo-spatial datasets, but can be extend to process any kind of data.

Principle

Input: raw metadata attributes in the form of a dictionary

Output: a dictionary in which normalized parameter names are associated with values found in the raw metadata.

The actual work of extracting attribute values from the raw metadata is done by normalizers.

Each normalizer is a class able to deal with a particular type of metadata. In the case of geo-spatial datasets, a normalizer is typically able to deal with the metadata format of a particular data provider.

Usage

Although normalizers can be used directly, the easiest way to normalize metadata is to use a MetadataHandler. A metadata handler is initialized using a base normalizer class.

When trying to normalize metadata (using the handler's get_parameters() method), the handler tries all normalizers which inherit from this base class. The first normalizer which is able to deal with the metadata is used.

To determine if a normalizer is able to deal with a dictionary of raw metadata, the handler calls its check() method.

Example to normalize data for use in django-geo-spaas:

import metanorm.handlers as handlers
import metanorm.normalizers as normalizers

metadata_to_normalize = {
  'foo': 'bar,
  'baz': 'qux'
}

m = handlers.MetadataHandler(normalizers.geospaas.GeoSPaaSMetadataNormalizer)
normalized_metadata = m.get_parameters(metadata_to_normalize)

metanorm's People

Contributors

aperrin66 avatar azamifard avatar akorosov avatar dependabot[bot] avatar

Watchers

James Cloos avatar Morten Stette avatar Denis Demchev avatar Timothy Williams avatar  avatar  avatar Lars-Gunnar Persson avatar  avatar

Forkers

aperrin66

metanorm's Issues

Add normalizers for OSISAF data

This is connected to issue in harvesters: nansencenter/django-geo-spaas-harvesting#16

Normalizers for metadata from ingestors (similar to SentinelSAFEMetadataNormalizer) and from Identifier (similar to SentinelOneIdentifierMetadataNormalizer) should be added.

The identifier normalizer should add Parameters:
standard_name = sea_ice_area_fraction

It is important to keep in mid extendability to add more sources of data and more parameters from OSISAF.

Datasets with identical metadata may link with multiple URIs

In the case when two files have the same time and the same coordinates and other parameters (e.g. GW1AM2__01D_EQOA and GW1AM2__01D_EQOA) only one Dataset is created which has link to two URIs.

Instead two Datasets should be created, each pointing into two different files with different filenames.

Add information to the summary field

The summary field should include:

  • the processing level (L1, L2, etc...)
  • for models, the name of the model

These should be added with a structure eliminating possible ambiguities when appending to an existing summary.

Add normalizer for Radarsat2 CSV file from CSA

I've got a CSv file from Canada with metadata like that:

Result Number,Satellite,Date,Beam Mode,Polarization,Type,Image Id,Image Info,Metadata,Reason,Sensor Mode,Orbit Direction,Order Key,SIP Size (MB),Service UUID,Footprint,Look Orientation,Band,Title,Options,Absolute Orbit,Orderable
1,RADARSAT-2,2020-07-16 02:14:27 GMT,ScanSAR Wide A (W1 W2 W3 S7),HH HV,SGF,831163,"{""headers"":[""Product Type"" ""LUT Applied"" ""Sampled Pixel Spacing (Panchromatic)"" ""Product Format"" ""Geodetic Terrain Height""] ""relatedProducts"":[{""values"":[""SGF"" ""Ice"" ""100.0"" ""GeoTIFF"" ""0.00186""]}] ""collectionID"":""Radarsat2"" ""imageID"":""7337877""}",dummy value,,ScanSAR Wide,Ascending,RS2_OK121511_PK1076349_DK1021326_SCWA_20200716_021427_HH_HV_SGF,84,SERVICE-RSAT2_001-000000000000000000,-146.008396 73.905427 -143.459486 72.212173 -127.936480 73.451549 -128.875274 75.249738 -146.008396 73.905427 ,Right,C,rsat2_20200716_N7370W13656,,65706,TRUE

I want to make a small ingester for this kind of files (maybe part of harvesting) but I need metanorm to interpret this metadat correcttly.

Wrong time coverage in the URL normalizer

The time coverage for the following dataset is wrong: ftp://nrt.cmems-du.eu/Core/SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046/dataset-duacs-nrt-global-merged-allsat-phy-l4/2020/11/nrt_global_allsat_phy_l4_20201101_20201104.nc

When downloading the dataset and looking at its metadata, the time coverage is the following:

  • time_coverage_start=2020-10-31T12:00:00Z
  • time_coverage_end=2020-11-01T12:00:00Z

But the time coverage returned by the normalizer is 2020-11-01T00:00:00Z to 2020-11-02T00:00:00Z.

The time coverage in all cases where the URL normalizer is used must be checked.

Add new normalizer based on filename

Develop Sentinel1FilenameMetadataNormalizer that can retrieve

  • entry_id
  • platform
  • instrument
  • time_coverage_start
  • time_coverage_end
  • provider

From filename

Example filename:
ftp://ftp.nersc.no/nansat/test_data/sentinel1_l1/S1A_EW_GRDM_1SDH_20150702T172954_20150702T173054_006635_008DA5_55D1.zip

And (incomplete) example of unittest

class SentinelFilenameMetadataNormalizerTests(unittest.TestCase):
    """Tests for the SentinelFilenameMetadataNormalizer """

    def setUp(self):
        self.normalizer = normalizers.SentinelFilenameMetadataNormalizer([])

    def test_platform(self):
        """ shall return platform from SentinelSAFEMetadataNormalizer """
        attributes = {'filename': 'S1A_...'}
        self.assertEqual(self.normalizer.get_platform(attributes), pythesing_like_representation('Sentinel-1A'))

Refactor the URL normalizer's time coverage code

The code dealing with time coverage in the URL normalizer is unreadable and repetitive.
It should be rewritten in a clearer way.

There are also mistakes to fix in the time coverage for some sources.

Sometime the get_dataset_parameter returns None!

I think it is good to augment the code in this line:

new_members = getattr(self, 'get_' + param)(raw_attributes)

to be modifed as:

                new_members = getattr(self, 'get_' + param)(raw_attributes) or []

The reason is that if the get_dataset_parameter does not return anything in any normalizer, then the new_member becomes None and this causes error in the next line:

for new_member in new_members:

and raise an error during the harvesting process.

So for the safety of harvesting process, I think it is good to give it second chance to be [] by using python or so that the Nonetype error is never raise!

I have seen this error when I accidentaly wrote the incorrect get_dataset_parameter which gives None sometimes!

What is your feedback @aperrin66 ?

python data types revisited!

@aperrin66
@akorosov

As mentioned in nansencenter/django-geo-spaas-harvesting#40 Python lists are only needed when we want to save the order of elements inside it. When we don't need order, there is no need to use a python list!!!!

Sets should be used for the case that we add another element in it afterwards.
tuple should be used for the case that we do not need to change it afterwards.

For the speed of running the code the order of selection data types is as follows:

1.tuple
2.set
3.list

Migrate the Sentinel SAFE normalizer

Migrate the Sentinel SAFE (Copernicus scihub format) normalizer based on sentinel_safe.py and sentinel1_identifier.py.
This will probably result in several normalizer inheriting from a base Sentinel SAFE normalizer.

Simplification of metanorm

Metanorm was originally designed this way: each normalizer takes care of one metadata convention, then passes responsibility for the attributes it could not fill to the next normalizer.

Looking at the state of the code now, it appears that this is only applicable in some rare cases. The metadata conventions are followed so loosely and vary so much from one metadata provider to the next that most normalizers end up being specific to a provider.

This results in weird and/or inefficient code which must reconcile real world cases with the original design of metanorm.

We could probably make the code both more simple and efficient by having a structure like this:

  • one normalizer per provider
  • each normalizer has a method that can tell from the raw attributes if the normalizer can be used
  • each normalizer provides the necessary methods to fill all the attributes

UPDATE:
The base structure is in place, and I migrated the Creodias normalizer to have a simple example.

Here are the remaining normalizers to migrate/create (hopefully I did not forget any):

For each of these, please create a branch from issue81_simplification_refactoring, do the modifications, and open a pull request with issue81_simplification_refactoring as target branch.

The new normalizers will be put in the metanorm/normalizers/geospaas/ folder.

The new Creodias normalizer can be taken as example.

Once all the normalizers have been migrated, we can remove the old base classes and move on to adapt geospaas_harvesting.

Migrate AVISO normalizer

Migrate AVISO normalizer based on aviso.py.
It should also be generalized, as it only deals with one file for now.

Metanorm depends on Django

Normalizers create instance of GeosGeometry and return to Harvesting, but that creates a dependency on Django. To reduce dependecies (and make Metanorm easier integrated in other software, e.g. Nansat) GeosGeometry should be created in Harvesters and Metanorm should return only strings.

Add support for new CMEMS products in the URL normalizer

  • MEDSEA_ANALYSIS_FORECAST_PHY_006_013
  • IBI_ANALYSIS_FORECAST_PHYS_005_001
  • INSITU_GLO_UV_NRT_OBSERVATIONS_013_048

Edit: it turns out we will probably need to download the INSITU_GLO_UV_NRT_OBSERVATIONS_013_048 product and harvest it locally, so it will be the subject of another issue.

When using pti, which list must be used?

As far as I know, the pti repo uses two lists, and this two list provide different outputs sometimes. For example one list provides:
image

and the other one is :
image

It must be justified which one should we use for the future. This causes problems in the future especially in the case of working with cumulative parameters in its corresponding repetative mechanism that we developed recently in the metanorm repo.

Up to now, we have used search_cf_standard_name_list method of pti for normalizers in metanorm or sometimes get_cf_standard_name method of it.
BUT, if we use get_wkt_variable method, it will causes above mentioned inconsistency and may cause repetitive parameter(that have the same statndard name) assignment to the same dataset.

Please, justify about the usage of pti dear @akorosov .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.