codait / exchange-metadata-converter Goto Github PK
View Code? Open in Web Editor NEWBasic conversion utility for YAML-based metadata descriptors
License: Apache License 2.0
Basic conversion utility for YAML-based metadata descriptors
License: Apache License 2.0
This data would be fed to both the DAX API and to our DAX data previews.
Propose this structure:
content:
- file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
description: Raw data file
records: 114546
size: 30M
type: CSV
mime_type: text/csv
column_types:
STATION: str
STATION_NAME: str
ELEVATION: float
LATITUDE: float
...
Currently IDs use dashes instead of underscores:
Using underscores potentially eases Python users' life by allowing them to type IDs as variable names (say `dax.datasets.noaa_weather_data_jfk_airport). If it's not possible, we would have to do a dash-underscore conversion on one end.
My review is in the perspective of usage in OpenAIHub and what end-users want in general.
Reference:
Comments:
I used only JFK yaml for this review
yaml.load(sys.arg...
=> yaml.load(Path(sys.arg...
Problems can be located by running git grep "yaml.load(sy" *
The metadata.name
in the generated DLF YAML does not comply with the Kubernetes spec for DNS-1123 subdomain names.
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Dataset.com.ie.ibm.hpsys \"Finance Proposition Bank\" is invalid: metadata.name: Invalid value: \"Finance Proposition Bank\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
"reason": "Invalid",
"details": {
"name": "Finance Proposition Bank",
"group": "com.ie.ibm.hpsys",
"kind": "Dataset",
"causes": [
{
"reason": "FieldValueInvalid",
"message": "Invalid value: \"Finance Proposition Bank\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
"field": "metadata.name"
}
]
},
"code": 422
}
Currently all {{...}}
placeholders in https://github.com/CODAIT/exchange-metadata-converter/tree/main/templates are considered to be required and therefore each placeholder input file must define them. Annotations would solve this issue. Investigate what it takes to support something like this, where @annotation_key
serves as a hint to the processing engine and doesn't invalidate the YAML file because it is specified as a comment.
property: '{{value}}' # just a comment
another:
property2: '{{value2}}' # @optional and a comment
third_property: # comment only
fourth_property:
fifth_property: '{{value6}}' # @annotation_only
Current proposal is to release a new version of ORSD which no longer has nested archives. Then we can use a structure like:
content:
- file_name: data/SPE9-TRIANGLE.Aspect1/test
...
- file_name: data/SPE9-TRIANGLE.Aspect1/train
...
- file_name: data/SPE9-TRIANGLE.Aspect2/json_test
...
- file_name: data/SPE9-TRIANGLE.Aspect2/json_train
...
- file_name: data/SPE9-TRIANGLE.Aspect3.compressed.h5
...
- file_name: data/SPE9-MAX.Aspect1
...
- file_name: data/SPE9-MAX.Aspect2
...
- file_name: data/SPE9-MAX.Aspect3.compressed.h5
...
Keeping in mind, the archive level description field for the dataset will need to describe the content composition, e.g. "...contains two versions of the dataset, SPE9-TRIANGLE which... and SPE9-MAX which..."
Currently the example uses:
# TBD how to handle compound types (a data set comprises of multiple files using different format)
format:
type: CSV
mime_type: text/csv
But as we know some DAX datasets contain more than one subdataset format. Some potential solutions:
Information on how to locate all the files belonging to a certain subdataset is important for the DAX API and how it handles loading in subdatasets. Note, this is different from a subdataset's format, which is simply the file format of the subdataset.
Examples of subdataset types:
train_
appended at the start of the filename)We need to determine what subdataset types there are and how to include this information. The current proposal is this:
Simple file:
- file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
...
format: CSV # rename from type to format
...
type: path_name
value: noaa-weather-data-jfk-airport/jfk_weather.csv
Regex:
- file_name: publaynet/train
...
type: regex
value: "train/*"
List of files:
- file_name: tfsc/train_list.txt
...
type: list_of_files
value: tfsc/train_list.txt
There's probably a better way of structuring this that avoids the file_name
being the same as the value
field in some cases, but it's a start.
Is https://pypi.org/project/ruamel.yaml/ better with respect to
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.