vforwater / metacatalog Goto Github PK

View Code? Open in Web Editor NEW

3.0 4.0 1.0 2.98 MB

Modular metadata management platform for environmental data.

Home Page: https://vforwater.github.io/metacatalog

License: GNU General Public License v3.0

Python 65.77% Jupyter Notebook 31.01% Jinja 3.12% Dockerfile 0.09%

sqlalchemy postgresql postgis iso19115 vforwater

metacatalog's People

Stargazers

Watchers

Forkers

standardgalactic

metacatalog's Issues

Add citation info

We ultimately need a Entry.citation property. This can either be implemented as a text field with a default value or a lookup table with more structured information filled by default.

Check Database version on startup

Whenever creating a connection to the database, the content of alembic_version table should be checked against the revisions present in the current metacatalog repo. Two kind of warnings can be raised:

metacatalog has a newer revision and warns the user that the DB needs an upgrade
metactaalog can't find the DB revision because metacatalog itself is outdated

implement str for all Models

This should be done quickly, to make the upload scripts a bit more expressive

Add DataSourceDataType

A data source needs two types. A data type and an origin type, in order to translate it into ISO-19115 finally.

Link either of both to MD_DatatypeCode and MD_CoverageContentTypeCode (if raster).

Strange geneic name

Is there a reason to write geneic in geometry_data.py or is this just a typo?

class GenericGeometryData(Base):
    __tablename__ = 'geneic_geometry_data'

define metadataExtensionInformation

In ISO 19115 it is possible to define extensions that describe metadata not part of ISO. We could bind that to models.Detail and move everything into details that cannot be mapped to ISO directly.

Docs are BS

As of now, the docs are real BS. The CI is working but the output is not really helpful. Will have a look into this...

Details need to map to ISO 19115 lists

Some Entry records only apply to specific Entry types and some optional ISO 19115 code lists can only be filled, if the data source provides sufficient information. Thus, we need a global object that can be adapted by the user to search Detail keys for names that are defined in ISO 19115 code lists

Refactor aliased Keywords to Key-Value Pairs

Already for Sap Flow data and Eddy data the proposed controlled Keyword concept is not working anymore. I will refactor it into:

a) controlled keywords fed from different dictionaries, which just tag an Entry to apply keyword filters.
b) an arbitrary amount of key-value pairs that can be linked to an Entry. I will test out some stemming libraries in Python to standardize the keys to a specific amount. NoSQL would be great here, but that's not a way to go.

One example that did not work anymore was the metadata of sap flow sensors, how deep they are built into a tree. If I would introduce a custom keyword that describes the horizontal depth of a tree in a standardized manner it might work for metacatalog, but when exported to ISO19115 we would lose this vital information unless we register this as an official keyword of our self-hosted dictionary. That is not the way to go for now. The other possibility would be to put vital metadata into the comments, which we try to overcome with the new db scheme.

@MarcusStrobl do you have any quick comments on this? Any alternative, which can be implemented until next week?

Cast single JSON into list on read

When passing data using the --json flag, the CLI assumes that a list of objects is passed.

In metacatalog.api.io.from_json the content of records should be checked for instance type and converted into a list accordingly for convenience.

Add Sapflow data table

We need a sapflow data table. This first implementation can be quick and dirty, only for the data samples we have.

Add 2nd level data to find_entry endpoint

API calls like:

from metacatalog import api

session = api.connect_database()
api.find_entry(session, variable='sap flow')
api.find_entry(session, license=2)
api.find_entry(session, details=dict(foo='bar'))

should just be possible. No need to change something on the model side. CLI change is nice to have, but not necessary

Add to_dict to all models

For resolving #37 , all models need a to_dict function, which also should take a lazy argument for lazy-loading related models into the dict.
This is also a prerequisite to build a RESTful client/api for metacatalog.

To export an Entry to dict, #40 needs to be resolved as well

Add environment.yml

Should add an environment.yml to foster conda installs in deps over pip installs wherever possible.

Add alembic

Alembic is already needed on the Bridget presentation milestone.

DataSource needs reference system info

ISO 19115 defines the MD_ReferenceSystem code list for spatial and temporal reference systems. This needs to be further described by models.DataSource. Maybe can be implemented with default values, depending on the data-type with #61

align aggregateInfo and EntryGroup

The EntryGroup of type EntryGroupType.name=='composite' maps to a ISO 19115 object of type MD_AggregateInformation. We need to check if the info can be mapped

Add Unit to matric potential

Seems like the matric potential unit got lost on las t DB revision.

Use UUID as additional identifier for Entry

Most Geodata harvest portals and Catalogs recommend to use a UUID for the ISO19115 MD_Metadata.fileIdentifier field.

My proposal: add an additional column Entry.uuid and keep Entry.id as a unique identifier in metacatalog, as MD_Metadata.fileIdentifier is a free text field, which would slow metacatalog significantly down. I prefer to run queries against an Integer primary key.

Do you have an opinion on this @MarcusStrobl ?

Filter duplicates

Whenever Entry objects are loaded from the database there should be a filter_duplicates option to o include filter like

WHERE latest_version_id is not null

to only load the latest version of an Entry.

Tweak Details

Details is working fine, but there are two mid term improvements.

a) The details need a description. Will be a bit complicated how to put them in practice, as I use **kwargs to collect details. This has to be changed, though.
b) As soon as the first few thousand details are loaded, we have to see if we tokenize them before stemming and maybe turn stem into an array datatype.

importer and reader function rework

There are two enhancements:

The reader and importer should not only use **kwargs, but update datasource.args with kwargs to find optional settings.
The mapping from datasource_type to reader and importer should not be hardcoded. Better approach is to have an updateable mapping dict, with API wide register_* functions to set custom importer and reader.

Add find-by-location

Entry records need to be findable by location, to make actually use of PostGIS. I see the scenario:

api.find_entry takes an additional parameter called location. The parameter value can be:

tuple, list of len==2 or shapely.Point, without any additional information - exact match, which might not be super helpful
a tuple of tuple, list of len==2 or shapely.Point and a radius (float) to search for Entry records in a radius around that point
if the tupleor list above is of dim==2 or a shapely.Polygon, Entry records within that geometry will be found. A radius might be given.
if a str is given, it will be interpreted as WKT and turned into a shapely.Geometry

radius could be an additional parameter, to keep locationsimpler. list and tuple should be transformed to a shapely.Geometry, to verify geometric integrity.
It could also be possible to pass an integer only, which would refer to a geometry-table in PostgreSQL, containing pre-defined search geometries (like Catchments or whatsoever).

The whole logic has to go into a new spatial submodule (maybe under util) so that a new Entry.neighbors could be introduced as well.

Nullable columns in eddy_data.py

Which columns should be nullable in eddy_data.py )?

Originally posted by @mmaelicke in #2

Refactoring the 'end' date and version

Problem

I am unhappy with the 'end' attribute of an Entry.

short

It's not clear and can't be mapped into ISO19115.

long

I worked a bit more in ISO 19115 and found the CI_DateTypeCode. I am not sure how to map the current end date of i.e. a time series into this ISO CodeDefinition. It seems like the ISO does not have a date for the end of a time series or dataset.

Solutions

One possibility would be to implement the full code list as a lookup table and create an actual versioning history, that records event types. ISO 19115 and ISO 19115-1 define the MD_ProgressCode fur this purpose, which has to go into another lookup table.
Then, the combination of Entry in 'onGoing' progress state can have a date in the newly created history of type 'lastUpdated'. Feels like a bit of an overkill, but I don't know how we can offer ISO-compliant metadata without losing important information else-wise.

A combination of both, linked to an Entry.id would give a pretty neat version history but also require some substantial refactoring. Therefore, if accepted, I would put this into the next larger release.
The logic would then be that an Entry.version identifies the version of the metadata itself, whereas Entry.latest_version_id on other entires refer to the most recent one. Versioning information about the data would be called history and described by a data history table using ISO CodeDefinitions. On export, the requested Entry can use its progress state date to extract only the part of the history, that is relevant.

CodeLists for Reference

CI_DateTypeCode

<codelistItem>
<CodeListDictionary gml:id="CI_DateTypeCode">
<gml:description>identification of when a given event occurred</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">CI_DateTypeCode</gml:identifier>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_creation">
<gml:description>date identifies when the resource was brought into existence</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">creation</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_publication">
<gml:description>date identifies when the resource was issued</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">publication</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_revision">
<gml:description>date identifies when the resource was examined or re-examined and improved or amended</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">revision</gml:identifier>
</CodeDefinition>
</codeEntry>
<!--  19115-1 additions  -->
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_expiry">
<gml:description>date identifies when resource expires</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">expiry</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_lastUpdate">
<gml:description>date identifies when resource was last updated</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">lastUpdate</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_lastRevision">
<gml:description>date identifies when resource was last reviewed</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">lastRevision</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_nextUpdate">
<gml:description>date identifies when resource will be next updated</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">nextUpdate</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_unavailable">
<gml:description>date identifies when resource became not available or obtainable</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">unavailable</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_inForce">
<gml:description>date identifies when resource became in force</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">inForce</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_adopted">
<gml:description>date identifies when resource was adopted</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">adopted</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_deprecated">
<gml:description>date identifies when resource was deprecated</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">deprecated</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_superseded">
<gml:description>date identifies when resource was superseded or replaced by another resource</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">superseded</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_validityBegins">
<gml:description>time at which the data are considered to become valid. NOTE: There could be quite a delay between creation and validity begins</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">validityBegins</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
...
</codeEntry>
<codeEntry>
...
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_distribution">
<gml:description>date identifies when an instance of the resource was distributed</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">distribution</gml:identifier>
</CodeDefinition>
</codeEntry>
</CodeListDictionary>
</codelistItem>

MD_ProgressCode

<codelistItem>
<CodeListDictionary gml:id="MD_ProgressCode">
<gml:description>status of the dataset or progress of a review</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">MD_ProgressCode</gml:identifier>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_completed">
<gml:description>production of the data has been completed</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">completed</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_historicalArchive">
<gml:description>data has been stored in an offline storage facility</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">historicalArchive</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_obsolete">
<gml:description>data is no longer relevant</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">obsolete</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_onGoing">
<gml:description>data is continually being updated</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">onGoing</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_planned">
<gml:description>fixed date has been established upon or by which the data will be created or updated</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">planned</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_required">
<gml:description>data needs to be generated or updated</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">required</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_underDevelopment">
<gml:description>data is currently in the process of being created</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">underDevelopment</gml:identifier>
</CodeDefinition>
</codeEntry>
<!--  19115-1  -->
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_final">
<gml:description>progress concluded and no changes will be accepted</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">final</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_pending">
<gml:description>committed to, but not yet addressed</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">pending</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_retired">
<gml:description>item is no longer recommended for use. It has not been superseded by another item</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">retired</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_superseded">
<gml:description>replaced by new</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">superseded</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_tentative">
<gml:description>provisional changes likely before resource becomes final or complete</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">tentative</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_valid">
<gml:description>acceptable under specific conditions</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">valid</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_accepted">
<gml:description>agreed to by sponsor</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">accepted</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_notAccepted">
<gml:description>rejected by sponsor</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">notAccepted</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_withdrawn">
<gml:description>removed from consideration</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">withdrawn</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_proposed">
<gml:description>suggested that development needs to be undertaken</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">proposed</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_deprecated">
<gml:description>resource superseded and will become obsolete, use only for historical purposes</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">deprecated</gml:identifier>
</CodeDefinition>
</codeEntry>
</CodeListDictionary>
</codelistItem>

Enhancing default data model with uncertainty

Each observation in metacatalog should contain a statement about its certainty. Unlike #11 , where an arbitrary amount of uncertainty timeseries can be added to a dataset, each one described by its own set of metadata, this uncertainty should be a core feature of metacatalog and usually be used to plot error bars.
It can be implemented as an additional column in each default data table.

But:

what about asymetric errors? Do we enforce uncertainty time series for asymetric errors like proposed in #11 , or do we implement two columns?
what about multi-dimensional data or geometry-data (like a timeseries of geometries) ? how can this be implemented here?

Suggestions? @sihassl @MarcusStrobl ?

Add badges and rework README.md

Some things are a bit outdated. Can copy content when solving #26.
With more tests (issue needed to propose something) and coverage running (#27), badges should be helpful

[API] add extra endpoint for each EntryGroupType

It should be possbile to create a project like:

api.add_project(session, entries=[1,2,3,4,5,6], name='Awesome project')

ANd a composite dataset like:

entry = api.find_entry(session, id=5)[0]
entry.make_composite(name='Stuff that belong together', entries[5,4,6])

topicCategoryCode for Keywords

Due to the limited values which apply to metacatalog from MD_TopicCategoryCode Enum, it might be possible to map refactored Keywords at topic level after #70 with the topic category.

Alternativeley, this can always be filled with 'geoscientificInformation'.

Test pipeline

I want to have at least a very thin testing pipeline in metacatalog to use as a check against future pull requests.

Something like: prepare data in csv -> install db without error -> populate without error -> run some data upload and management tests -> test db saves against prepared csv.

That's not perfect but better than no tests at all

Add version mismatch to tests

The migration tests already test the exception if the database is behind the metacatalog version. The other way around is not tested.

This should be added to the migration tests by:

select the current HEAD revision id.
rename the file to something alembic does not recognize
Test: -> Invoke connection. Should now raise a Runtime Error with correct mismatch message
restore the file
run check_no_version_mismatch again

Add uncertainty as data type

To store uncertainty of measurement (or modelling) along with observations, we need to implement an uncertainty datatype. This is not about the default measurement error or so, but for the case a user wants to store a whole timeseries of uncertainty assessments and group it to existing data as additional information

For a proper name, suggestions are highly appreciated.

uncertainty
...

Also, do we want it just like that and let the metadata figure out what exactly it describes, like cross-validation results, device uncertainty, modelling error etc. Or: do we want to implement different types of uncertainty.
I would go only for one uncertainty type as of now. @sihassl ?

Re-implement Sensors as DataOrigin

In the old scheme, we had Sensors to store information about the device, that collected the sensor. In metacatalog>=0.2 the definition of Variable will be much broader than a physical parameter and thus the old Senor should be implemented more abstract, like DataOrigin or Provenance, where a physical sensor device is just a type of Origin.
Not so sure how to handle the different fields needed to describe a type like DataProduct, ModelOutput and Observation in one model and not sure if that makes sense.

Find by detail

We need some ideas, how the details can be utilized to find Entry records by ID. Either add as parameter to api.find_entry or put into a new api endpoint.

data Quality

We need to implement the ISO19115 and INSPIRE DQ_DataQuality table anyway and link to every Entry. If anything we want to cover is not covered here, we can use metadataExtensionInfo objects that are prefixed by dq_.

Error on install

@AlexDo1 . This is the original error message when running a clean install:

Well, there is something wrong with the csv file containing the default lookup data. That's an easy fix. But on that turn, I will improve the error management of the CLI as well.

But this one is on me.

Add EddyData class

@AlexDo1 We need the table for managing Eddy Covariance data. This kind of data will go into a newly created table. Please add all necessary columns to the model definition in eddy_data.py. You can refer to the other models to see how sqlalchemy works.

Align Keywords with ISO

The Keyword table will be refactored in 0.2 anyway. On this issue, we need to make sure, that they align with ISO 19115 and ISO 19115-2 MD_Keywords after refactoring

Add export flags to CLI find

The find command just prints the found model instances to StdOut, which is not always very helpful. The CLI should include some output flags like --json and --csv to output the result as JSON or CSV. For this to happen, the models have to include a to_dictfunction.

Add a e2e-test for composites

We need a test pipeline for EntryGroups

create a Project and find by project
create a Composite and find by composite
~~[ ] use API and models for finding~~
add a partial Entry to a composite and test if it is not found by api.find_entry

api needs verbosity settings

Most API Python functions print out info and errors all over the place. Also, catched and handled errors are printed for development. Maybe the CLI needs
a --dev flag for full output
a --quiet for no output
a --verbose flag for extended output.

Add data encoding

Store MD_CharacterSetCode as a lookup table for DataSource. Should default to 'utf-8' and warn if the value is overwritten.

And now: make docs cool!

The problem should be obvious.

Add Entry.author and Entry.authors

With Entry.contributors already implemented, there should be two helper properties that return the first author on Entry.author and a list of all authors on Entry.authors. These are all contributors of role 'author' or 'coAuthor' following ISO19115.

In a later issue the property.setter functions can be implemented.

Add metacatalog.api docstrings to documentation

Most of the api functions are documented, but the docstrings are not yet added to the documentation.
Should be an easy one.
Use .. autodoc:: for this

PartialEntry

@MarcusStrobl , we had another discussion about Eddy data yesterday and concluded, that we need to introduce a new column to Entry, called is_partial, a boolean column that is False by default. If an Entry is partial, it has to be part of a EntryGroup of type==composite. The metadata described by these kind of Entry do only make sense in conjunction with the other Entry records in the same composite, while the others are self-contained and also make sense without the partial Entry.

Example

The Eddy data in principle consists of wind measurements (Entry no. 1), e.g. CO2 fluxes (Entry no. 2) and Covariances between all the components (<- partial Entry). It makes sense to use only wind measurements or CO2 fluxes and not caring about the Eddy tower. You can use either without the other. But if you are interested in Eddy, you need all three. Thus if a user searches for wind, he should find Entry no. 1 (and only that), with a note that the data is part of a composite. If he searches for Eddy (the EntryGroup) and 'clicks' on Entry no. 1, the system has to load Entry no.1, no. 2 and the partial Entry.
The partial Entry itself does not make any sense like this and should not be found.

metacatalog

checklist was transferred to the PR #58

Full text search

In the long run, metacatalog should be able to run some full-text searches against the database. Basically, for a first step, indexing title, name fields, and the abstract and comment should be enough.
The search engine should be an abstract class, that takes a search key and some configuration object (javascript like - as a dictionary). This should make it possible to implement Postgresql full-text search as the default option, but also reach out to a synched elasticsearch instance as a 'pro' feature.