codait / exchange-metadata-converter Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 4.0 98 KB

Basic conversion utility for YAML-based metadata descriptors

License: Apache License 2.0

Python 100.00%

exchange-metadata-converter's People

Contributors

Stargazers

Watchers

Forkers

bdwyer2 ckadner digitaldiv mirror-dump

exchange-metadata-converter's Issues

Generate new PyPI package

After updating the DLF apiVersion:

e5bb985#commitcomment-43884771

@ptitzler

Add column type information on a subdataset level

This data would be fed to both the DAX API and to our DAX data previews.

Propose this structure:

content:
  - file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
    description: Raw data file
    records: 114546
    size: 30M
    type: CSV
    mime_type: text/csv
    column_types:
      STATION: str
      STATION_NAME: str
      ELEVATION: float
      LATITUDE: float
      ...

Possible to use underscores in IDs?

Currently IDs use dashes instead of underscores:

exchange-metadata-converter/dax-data-set-descriptors/jfk.yaml

Line 1 in 7160fd4

id: noaa-weather-data–jfk-airport

Using underscores potentially eases Python users' life by allowing them to type IDs as variable names (say `dax.datasets.noaa_weather_data_jfk_airport). If it's not possible, we would have to do a dash-underscore conversion on one end.

Review Feedback

My review is in the perspective of usage in OpenAIHub and what end-users want in general.

Reference:

Existing OpenAIHub YAML
Dataset Landing Page

Comments:

Can we add details about the archive contents of the dataset?
Would like to see dataset coverage as well. Having this will set the expectation of the users right.

I used only JFK yaml for this review

@ptitzler

Document programmatic usage

pip install
invocation of converter in code

Fix yaml.load(sys.arg... issues

yaml.load(sys.arg... => yaml.load(Path(sys.arg...

Problems can be located by running git grep "yaml.load(sy" *

Non-compliant metadata.name in the generated DLF YAML

The metadata.name in the generated DLF YAML does not comply with the Kubernetes spec for DNS-1123 subdomain names.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Dataset.com.ie.ibm.hpsys \"Finance Proposition Bank\" is invalid: metadata.name: Invalid value: \"Finance Proposition Bank\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
  "reason": "Invalid",
  "details": {
    "name": "Finance Proposition Bank",
    "group": "com.ie.ibm.hpsys",
    "kind": "Dataset",
    "causes": [
      {
        "reason": "FieldValueInvalid",
        "message": "Invalid value: \"Finance Proposition Bank\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
        "field": "metadata.name"
      }
    ]
  },
  "code": 422
}

@ptitzler

Support optional properties

Currently all {{...}} placeholders in https://github.com/CODAIT/exchange-metadata-converter/tree/main/templates are considered to be required and therefore each placeholder input file must define them. Annotations would solve this issue. Investigate what it takes to support something like this, where @annotation_key serves as a hint to the processing engine and doesn't invalidate the YAML file because it is specified as a comment.

property: '{{value}}'         #  just a comment
another:
  property2: '{{value2}}'     # @optional and a comment
third_property:               #  comment only
fourth_property:
fifth_property: '{{value6}}'  # @annotation_only

Determine way to flatten ORSD archive metadata & check to see if any other DAX archives have similarly complex nested structures

Current proposal is to release a new version of ORSD which no longer has nested archives. Then we can use a structure like:

content:
  - file_name: data/SPE9-TRIANGLE.Aspect1/test
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect1/train
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect2/json_test
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect2/json_train 
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect3.compressed.h5
     ...
  - file_name: data/SPE9-MAX.Aspect1
     ...
  - file_name: data/SPE9-MAX.Aspect2
     ...
  - file_name: data/SPE9-MAX.Aspect3.compressed.h5
     ...

Keeping in mind, the archive level description field for the dataset will need to describe the content composition, e.g. "...contains two versions of the dataset, SPE9-TRIANGLE which... and SPE9-MAX which..."

Determine how to handle archive level format field

Currently the example uses:

# TBD how to handle compound types (a data set comprises of multiple files using different format)
format:
  type: CSV
  mime_type: text/csv

But as we know some DAX datasets contain more than one subdataset format. Some potential solutions:

Remove this field and rely on subdataset level format field for this info
Add a compound type to be used when an archive contains more than one type of subdataset
Use a list that contains all the different subdataset formats

Automate publication on PyPI

Figure out how to properly model publaynet

https://raw.github.ibm.com/CODAIT/DAX-OpenAIHub/master/publaynet.yaml?token=AAAB74DLPORI4XHXVOXTKGC7RBKRC

Determine how to handle subdataset level type fields

Information on how to locate all the files belonging to a certain subdataset is important for the DAX API and how it handles loading in subdatasets. Note, this is different from a subdataset's format, which is simply the file format of the subdataset.

Examples of subdataset types:

a simple file = it's path name (e.g. txt, csv)
a directory (e.g. a directory of image files, a directory of subdirectories)
a list of files (e.g. a txt with paths to all files in the validation set)
a regex (e.g. all train files have train_ appended at the start of the filename)

We need to determine what subdataset types there are and how to include this information. The current proposal is this:

Simple file:

  - file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
    ...
    format: CSV  # rename from type to format
    ...
    type: path_name
       value: noaa-weather-data-jfk-airport/jfk_weather.csv

Regex:

  - file_name: publaynet/train
    ...
    type: regex  
      value: "train/*"

List of files:

  - file_name: tfsc/train_list.txt
    ...
    type: list_of_files  
      value: tfsc/train_list.txt

There's probably a better way of structuring this that avoids the file_name being the same as the value field in some cases, but it's a start.

Explore whether ruaml is better suited than pyyaml

Is https://pypi.org/project/ruamel.yaml/ better with respect to

comment preservation
output formatting?

codait / exchange-metadata-converter Goto Github PK

exchange-metadata-converter's People

Contributors

Stargazers

Watchers

Forkers

exchange-metadata-converter's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs