vforwater / metacatalog Goto Github PK
View Code? Open in Web Editor NEWModular metadata management platform for environmental data.
Home Page: https://vforwater.github.io/metacatalog
License: GNU General Public License v3.0
Modular metadata management platform for environmental data.
Home Page: https://vforwater.github.io/metacatalog
License: GNU General Public License v3.0
We ultimately need a Entry.citation
property. This can either be implemented as a text field with a default value or a lookup table with more structured information filled by default.
Whenever creating a connection to the database, the content of alembic_version
table should be checked against the revisions present in the current metacatalog repo. Two kind of warnings can be raised:
This should be done quickly, to make the upload scripts a bit more expressive
A data source needs two types. A data type and an origin type, in order to translate it into ISO-19115 finally.
Link either of both to MD_DatatypeCode and MD_CoverageContentTypeCode (if raster).
Is there a reason to write geneic
in geometry_data.py
or is this just a typo?
class GenericGeometryData(Base):
__tablename__ = 'geneic_geometry_data'
In ISO 19115 it is possible to define extensions that describe metadata not part of ISO. We could bind that to models.Detail
and move everything into details that cannot be mapped to ISO directly.
As of now, the docs are real BS. The CI is working but the output is not really helpful. Will have a look into this...
Some Entry records only apply to specific Entry types and some optional ISO 19115 code lists can only be filled, if the data source provides sufficient information. Thus, we need a global object that can be adapted by the user to search Detail
keys for names that are defined in ISO 19115 code lists
Already for Sap Flow data and Eddy data the proposed controlled Keyword concept is not working anymore. I will refactor it into:
a) controlled keywords fed from different dictionaries, which just tag an Entry
to apply keyword filters.
b) an arbitrary amount of key-value pairs that can be linked to an Entry
. I will test out some stemming libraries in Python to standardize the keys to a specific amount. NoSQL would be great here, but that's not a way to go.
One example that did not work anymore was the metadata of sap flow sensors, how deep they are built into a tree. If I would introduce a custom keyword that describes the horizontal depth of a tree in a standardized manner it might work for metacatalog, but when exported to ISO19115
we would lose this vital information unless we register this as an official keyword of our self-hosted dictionary. That is not the way to go for now. The other possibility would be to put vital metadata into the comments, which we try to overcome with the new db scheme.
@MarcusStrobl do you have any quick comments on this? Any alternative, which can be implemented until next week?
When passing data using the --json
flag, the CLI assumes that a list of objects is passed.
In metacatalog.api.io.from_json
the content of records
should be checked for instance type and converted into a list accordingly for convenience.
We need a sapflow data table. This first implementation can be quick and dirty, only for the data samples we have.
API calls like:
from metacatalog import api
session = api.connect_database()
api.find_entry(session, variable='sap flow')
api.find_entry(session, license=2)
api.find_entry(session, details=dict(foo='bar'))
should just be possible. No need to change something on the model side. CLI change is nice to have, but not necessary
Should add an environment.yml
to foster conda installs in deps over pip installs wherever possible.
Alembic is already needed on the Bridget presentation
milestone.
ISO 19115 defines the MD_ReferenceSystem
code list for spatial and temporal reference systems. This needs to be further described by models.DataSource
. Maybe can be implemented with default values, depending on the data-type with #61
The EntryGroup
of type EntryGroupType.name=='composite'
maps to a ISO 19115 object of type MD_AggregateInformation
. We need to check if the info can be mapped
Seems like the matric potential unit got lost on las t DB revision.
Most Geodata harvest portals and Catalogs recommend to use a UUID for the ISO19115 MD_Metadata.fileIdentifier
field.
My proposal: add an additional column Entry.uuid
and keep Entry.id
as a unique identifier in metacatalog, as MD_Metadata.fileIdentifier
is a free text field, which would slow metacatalog significantly down. I prefer to run queries against an Integer
primary key.
Do you have an opinion on this @MarcusStrobl ?
Whenever Entry
objects are loaded from the database there should be a filter_duplicates
option to o include filter like
WHERE latest_version_id is not null
to only load the latest version of an Entry
.
Details is working fine, but there are two mid term improvements.
a) The details need a description. Will be a bit complicated how to put them in practice, as I use **kwargs
to collect details. This has to be changed, though.
b) As soon as the first few thousand details are loaded, we have to see if we tokenize them before stemming and maybe turn stem into an array datatype.
There are two enhancements:
**kwargs
, but update datasource.args
with kwargs
to find optional settings.dict
, with API wide register_*
functions to set custom importer and reader.Entry
records need to be findable by location, to make actually use of PostGIS. I see the scenario:
api.find_entry
takes an additional parameter called location
. The parameter value can be:
tuple
, list
of len==2
or shapely.Point
, without any additional information - exact match, which might not be super helpfultuple
, list
of len==2
or shapely.Point
and a radius (float
) to search for Entry
records in a radius around that pointtuple
or list
above is of dim==2
or a shapely.Polygon
, Entry
records within that geometry will be found. A radius might be given.str
is given, it will be interpreted as WKT and turned into a shapely.Geometry
radius
could be an additional parameter, to keep location
simpler. list
and tuple
should be transformed to a shapely.Geometry, to verify geometric integrity.
It could also be possible to pass an integer only, which would refer to a geometry-table in PostgreSQL, containing pre-defined search geometries (like Catchments or whatsoever).
The whole logic has to go into a new spatial
submodule (maybe under util
) so that a new Entry.neighbors
could be introduced as well.
Which columns should be nullable
in eddy_data.py )?
Originally posted by @mmaelicke in #2
I am unhappy with the 'end'
attribute of an Entry
.
It's not clear and can't be mapped into ISO19115.
I worked a bit more in ISO 19115 and found the CI_DateTypeCode
. I am not sure how to map the current end date of i.e. a time series into this ISO CodeDefinition. It seems like the ISO does not have a date for the end of a time series or dataset.
One possibility would be to implement the full code list as a lookup table and create an actual versioning history, that records event types. ISO 19115 and ISO 19115-1 define the MD_ProgressCode
fur this purpose, which has to go into another lookup table.
Then, the combination of Entry
in 'onGoing'
progress state can have a date in the newly created history of type 'lastUpdated'. Feels like a bit of an overkill, but I don't know how we can offer ISO-compliant metadata without losing important information else-wise.
A combination of both, linked to an Entry.id
would give a pretty neat version history but also require some substantial refactoring. Therefore, if accepted, I would put this into the next larger release.
The logic would then be that an Entry.version
identifies the version of the metadata itself, whereas Entry.latest_version_id
on other entires refer to the most recent one. Versioning information about the data would be called history and described by a data history table using ISO CodeDefinitions. On export, the requested Entry
can use its progress state date to extract only the part of the history, that is relevant.
<codelistItem>
<CodeListDictionary gml:id="CI_DateTypeCode">
<gml:description>identification of when a given event occurred</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">CI_DateTypeCode</gml:identifier>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_creation">
<gml:description>date identifies when the resource was brought into existence</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">creation</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_publication">
<gml:description>date identifies when the resource was issued</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">publication</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_revision">
<gml:description>date identifies when the resource was examined or re-examined and improved or amended</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">revision</gml:identifier>
</CodeDefinition>
</codeEntry>
<!-- 19115-1 additions -->
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_expiry">
<gml:description>date identifies when resource expires</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">expiry</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_lastUpdate">
<gml:description>date identifies when resource was last updated</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">lastUpdate</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_lastRevision">
<gml:description>date identifies when resource was last reviewed</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">lastRevision</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_nextUpdate">
<gml:description>date identifies when resource will be next updated</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">nextUpdate</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_unavailable">
<gml:description>date identifies when resource became not available or obtainable</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">unavailable</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_inForce">
<gml:description>date identifies when resource became in force</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">inForce</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_adopted">
<gml:description>date identifies when resource was adopted</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">adopted</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_deprecated">
<gml:description>date identifies when resource was deprecated</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">deprecated</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_superseded">
<gml:description>date identifies when resource was superseded or replaced by another resource</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">superseded</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_validityBegins">
<gml:description>time at which the data are considered to become valid. NOTE: There could be quite a delay between creation and validity begins</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">validityBegins</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
...
</codeEntry>
<codeEntry>
...
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="CI_DateTypeCode_distribution">
<gml:description>date identifies when an instance of the resource was distributed</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">distribution</gml:identifier>
</CodeDefinition>
</codeEntry>
</CodeListDictionary>
</codelistItem>
<codelistItem>
<CodeListDictionary gml:id="MD_ProgressCode">
<gml:description>status of the dataset or progress of a review</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">MD_ProgressCode</gml:identifier>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_completed">
<gml:description>production of the data has been completed</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">completed</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_historicalArchive">
<gml:description>data has been stored in an offline storage facility</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">historicalArchive</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_obsolete">
<gml:description>data is no longer relevant</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">obsolete</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_onGoing">
<gml:description>data is continually being updated</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">onGoing</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_planned">
<gml:description>fixed date has been established upon or by which the data will be created or updated</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">planned</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_required">
<gml:description>data needs to be generated or updated</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">required</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_underDevelopment">
<gml:description>data is currently in the process of being created</gml:description>
<gml:identifier codeSpace="ISOTC211/19115">underDevelopment</gml:identifier>
</CodeDefinition>
</codeEntry>
<!-- 19115-1 -->
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_final">
<gml:description>progress concluded and no changes will be accepted</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">final</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_pending">
<gml:description>committed to, but not yet addressed</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">pending</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_retired">
<gml:description>item is no longer recommended for use. It has not been superseded by another item</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">retired</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_superseded">
<gml:description>replaced by new</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">superseded</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_tentative">
<gml:description>provisional changes likely before resource becomes final or complete</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">tentative</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_valid">
<gml:description>acceptable under specific conditions</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">valid</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_accepted">
<gml:description>agreed to by sponsor</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">accepted</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_notAccepted">
<gml:description>rejected by sponsor</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">notAccepted</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_withdrawn">
<gml:description>removed from consideration</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">withdrawn</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_proposed">
<gml:description>suggested that development needs to be undertaken</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">proposed</gml:identifier>
</CodeDefinition>
</codeEntry>
<codeEntry>
<CodeDefinition gml:id="MD_ProgressCode_deprecated">
<gml:description>resource superseded and will become obsolete, use only for historical purposes</gml:description>
<gml:identifier codeSpace="http://standards.iso.org/iso/19115">deprecated</gml:identifier>
</CodeDefinition>
</codeEntry>
</CodeListDictionary>
</codelistItem>
Each observation in metacatalog should contain a statement about its certainty. Unlike #11 , where an arbitrary amount of uncertainty timeseries can be added to a dataset, each one described by its own set of metadata, this uncertainty should be a core feature of metacatalog and usually be used to plot error bars.
It can be implemented as an additional column in each default data table.
But:
Suggestions? @sihassl @MarcusStrobl ?
It should be possbile to create a project like:
api.add_project(session, entries=[1,2,3,4,5,6], name='Awesome project')
ANd a composite dataset like:
entry = api.find_entry(session, id=5)[0]
entry.make_composite(name='Stuff that belong together', entries[5,4,6])
Due to the limited values which apply to metacatalog from MD_TopicCategoryCode Enum
, it might be possible to map refactored Keywords at topic level after #70 with the topic category.
Alternativeley, this can always be filled with 'geoscientificInformation'
.
I want to have at least a very thin testing pipeline in metacatalog to use as a check against future pull requests.
Something like: prepare data in csv -> install db without error -> populate without error -> run some data upload and management tests -> test db saves against prepared csv.
That's not perfect but better than no tests at all
The migration tests already test the exception if the database is behind the metacatalog version. The other way around is not tested.
This should be added to the migration tests by:
HEAD
revision id.alembic
does not recognizecheck_no_version_mismatch
againTo store uncertainty of measurement (or modelling) along with observations, we need to implement an uncertainty datatype. This is not about the default measurement error or so, but for the case a user wants to store a whole timeseries of uncertainty assessments and group it to existing data as additional information
For a proper name, suggestions are highly appreciated.
Also, do we want it just like that and let the metadata figure out what exactly it describes, like cross-validation results, device uncertainty, modelling error etc. Or: do we want to implement different types of uncertainty.
I would go only for one uncertainty type as of now. @sihassl ?
In the old scheme, we had Sensor
s to store information about the device, that collected the sensor. In metacatalog>=0.2
the definition of Variable
will be much broader than a physical parameter and thus the old Senor
should be implemented more abstract, like DataOrigin
or Provenance
, where a physical sensor device is just a type of Origin
.
Not so sure how to handle the different fields needed to describe a type like DataProduct
, ModelOutput
and Observation
in one model and not sure if that makes sense.
We need some ideas, how the details can be utilized to find Entry
records by ID. Either add as parameter to api.find_entry
or put into a new api endpoint.
We need to implement the ISO19115 and INSPIRE DQ_DataQuality
table anyway and link to every Entry
. If anything we want to cover is not covered here, we can use metadataExtensionInfo
objects that are prefixed by dq_
.
@AlexDo1 . This is the original error message when running a clean install:
Well, there is something wrong with the csv file containing the default lookup data. That's an easy fix. But on that turn, I will improve the error management of the CLI as well.
But this one is on me.
@AlexDo1 We need the table for managing Eddy Covariance data. This kind of data will go into a newly created table. Please add all necessary columns to the model definition in eddy_data.py. You can refer to the other models to see how sqlalchemy works.
The Keyword
table will be refactored in 0.2
anyway. On this issue, we need to make sure, that they align with ISO 19115 and ISO 19115-2 MD_Keywords
after refactoring
The find
command just prints the found model instances to StdOut, which is not always very helpful. The CLI should include some output flags like --json
and --csv
to output the result as JSON or CSV. For this to happen, the models have to include a to_dict
function.
We need a test pipeline for EntryGroups
api.find_entry
Most API Python functions print out info and errors all over the place. Also, catched and handled errors are printed for development. Maybe the CLI needs
a --dev
flag for full output
a --quiet
for no output
a --verbose
flag for extended output.
Store MD_CharacterSetCode as a lookup table for DataSource. Should default to 'utf-8'
and warn if the value is overwritten.
The problem should be obvious.
With Entry.contributors
already implemented, there should be two helper properties that return the first author on Entry.author
and a list of all authors on Entry.authors
. These are all contributors of role 'author'
or 'coAuthor'
following ISO19115.
In a later issue the property.setter
functions can be implemented.
Most of the api functions are documented, but the docstrings are not yet added to the documentation.
Should be an easy one.
Use .. autodoc::
for this
@MarcusStrobl , we had another discussion about Eddy data yesterday and concluded, that we need to introduce a new column to Entry
, called is_partial
, a boolean column that is False
by default. If an Entry
is partial, it has to be part of a EntryGroup
of type==composite
. The metadata described by these kind of Entry
do only make sense in conjunction with the other Entry
records in the same composite, while the others are self-contained and also make sense without the partial Entry
.
The Eddy data in principle consists of wind measurements (Entry no. 1), e.g. CO2 fluxes (Entry no. 2) and Covariances between all the components (<- partial Entry). It makes sense to use only wind measurements or CO2 fluxes and not caring about the Eddy tower. You can use either without the other. But if you are interested in Eddy, you need all three. Thus if a user searches for wind, he should find Entry no. 1 (and only that), with a note that the data is part of a composite. If he searches for Eddy (the EntryGroup) and 'clicks' on Entry no. 1, the system has to load Entry no.1, no. 2 and the partial Entry.
The partial Entry itself does not make any sense like this and should not be found.
checklist was transferred to the PR #58
In the long run, metacatalog should be able to run some full-text searches against the database. Basically, for a first step, indexing title
, name
fields, and the abstract
and comment
should be enough.
The search engine should be an abstract class, that takes a search key and some configuration object (javascript like - as a dictionary). This should make it possible to implement Postgresql full-text search as the default option, but also reach out to a synched elasticsearch instance as a 'pro' feature.
ISO 19115 defines the Objects MD_Usage
and MD_Constraints
which both map into models.License
. We need to check if and how these objects can be represented
Enhance the Gh action to add some coverage reports.
Codecov has a action: https://github.com/codecov/codecov-action
Whenever a api.find_*
function accepts string input, a like=True
should be available to switch the filter in sqlalchemy
from a literal match to PostgreSQL LIKE
syntax. The given search string can then automatically replace *
with %
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.