mlcommons / croissant Goto Github PK

Croissant is a high-level format for machine learning datasets that brings together four rich layers.

Home Page: https://mlcommons.org/croissant

License: Apache License 2.0

Python 49.48% Makefile 0.16% Jupyter Notebook 38.28% HTML 9.85% JavaScript 1.18% Shell 0.10% TypeScript 0.89% Dockerfile 0.04%

datasets json-ld machine-learning schema-org

croissant's Issues

Implement more readers for CSV, Parquet, JSON, JSON-L, text files.

ReadCsv becomes Read and supports:

~~CSV,~~
~~Parquet,~~
~~JSON,~~
~~JSON-L,~~
~~git lfs~~
~~text files.~~

Distinguish `sources` from `references` in the codebase

Currently we define sources and references using the same class:

class Source:
    reference: tuple[str] = ()
    apply_transform_regex: str | None = None
    apply_transform_separator: str | None = None

Source.reference is a tuple indicating:

Another field: whose UID would be the concatenation of all the str in reference.
A CSV/JSON column: the UID is reference[0] and the column name is reference[1].

Problems:

Naming: sources and references are here used with very different meanings.
Source.reference is ambiguous and hard to use.

Instead, we would like something like:

class Origin:
    origin_uid: str  # UID in the structure graph
    column: str | None = None  # column if the origin references a CSV or a JSON.
    apply_transform_regex: str | None = None
    apply_transform_separator: str | None = None

Widgets for Python version, green CI/CD

OpenMLConverter: Add contentSize and hash

After review by Pierre Ruyssen:
It would be nice to add a checksum (sha256). This is critical for libraries such as TFDS who want to verify the integrity of files before processing them, both from a security and correctness perspective.

We should also set contentSize, as this would make it easier to show how much download to expect, download progress, etc.

https://schema.org/sha256 is pending implementation feedback but I suppose we can use it. TFDS also uses MD5, and while re-computing the sha256 is possible, I don't know how much friction it would add to the various actors...

TODO:

check if we can return contentSize from OpenML (OpenML currently does not give this info)
check which checksum is returned by OpenML. Is it the .arff or the .pq file?

Check search engines support - update examples to use an external context file

Originally posted by @benjelloun in #34 (comment)

Omar I assign to you to check whether external context files are supported by search engines.

Thanks!

Refactor ComputationGraph.from_nodes

This method traverses the structure graph to build the ComputationGraph.

Possible improvements:

Split in several smaller functions. Notably, each if cases could be handled in independent functions.
Traverse the graph more optimally to avoid manually generated if cases.

Generate datasets should be reproducible

At the moment, when loading two times the same dataset, we may not yield the elements in the same order.

To test this in python/ml_croissant/ml_croissant/_src/datasets_test.py, replace:

    for record in records:
        assert _there_exists_an_equal_dict(record, expected_records)

    for i, record in enumerate(records):
        assert _dicts_are_equal(record, expected_records[i])

Use Dask to scale DataFrames and the computation graph to bigger datasets

Example: generate big datasets like the C4 dataset.

Human-readable pkl files in tests

In python/ml_croissant/ml_croissant/_src/datasets_test.py, we compare the loaded data to a golden pkl file.

This pkl file is a binary, so it's not human-readable. Try to find a human-readable, e.g. JSON-L. It's not trivial because pandas values (like Timestamp or nan) are not pickable, but it should be easy enough to be a good first issue.

Define versioning mechanism

Versioning includes a few problems:

Versioning the Croissant format and referring to the version of the format used in a config.
Versioning the Croissant config: where should the config version be specified? How? What should be the guidelines here?
Live vs snapshot (reproducible) datasets.

Related discussion: #58
Related documentation to write: https://github.com/mlcommons/croissant/blob/main/docs/howto/versions.md

Croissant to Croissant converter

We need a Croissant to Croissant converter to introduce new syntaxes: the converter reads the old Croissant configs and produces configs using the latest syntax.

Such a tool would for example be useful in the context of:

#51, to quickly update all configs as we iterate on reference syntax, or
#58 and #91

Manage issues for both static and dynamic analysis

Currently:

During static analysis: we gather as many issues as possible by populating an object Issue.
During dynamic analysis: we run assert to check AssertionError. assert statements are not executed when python is run with the optimize flag (-O).

We should find a unified way to deal with issues.

Cleaner unit tests in golden files

In unit tests, compare pairs (input JSON, output in the command line) in golden files.

This also allows to have up-to-date documentation with good/bad examples.

New fields for FileObject

As described in the DCF spec, we want fields on the FileObject that are not part of the parent class (CreativeWork).

containedIn
contentUrl
contentSize
sameAs

Moreover, the name should not contain special characters, enabling us to use it to identify this FileObject.

Use `node.add_error`/`node.add_warning` instead of `issues.add_error`/`issues.add_warning` when possible

node.add_error/node.add_warning give more context than issues.add_error/issues.add_warning.

Replace when possible in the whole codebase to have better error/warning messages.

Add support for enumerations

In this discussion, we reached a consensus on how to represent enumerations at the RecordSet and Field level, by introducing an "isEnumeration" boolean property.

We should add that mechanism to the spec, examples, and validator.

Read *.flac audio files

Python documentation for mlcroissant library

Use *.ttl files

*.ttl files are used by schema.org to:

Generate the HTML documentation
Declare new types that are not defined in schema.org.

Instead of doing this, at this point, we manually extract properties by iterating on the RDF graph. This is a temporary solution to bootstrap the lib, but should be replaced by ttl files.

Partition support

There is already a howto about splits (https://github.com/mlcommons/croissant/blob/main/docs/howto/specify-splits.md) and an example (https://github.com/mlcommons/croissant/blob/main/datasets/coco2014/metadata.json).

However we also want support for other types of partitions, namely dated partitions and languages (eg: wikipedia).

Currently there is no support for partitions in the validator / loader. We should make sure it is possible to retrieve a single (or a few) partition(s) and only download the required files. We should also make sure it is possible to retrieve many partitions (not just one language for example).

There is no existing howto page for partitions, but I think we need one.

Remove the usage of rdflib.Graph and parse the JSON directly.

Until now, we have used rdflib.Graph to parse the JSON into an RDF graph. This graph is then parsed.

We've used rdflib out of convenience. We could remove this additional complexity by 1) expanding all key/value of the JSON using RDFLib, and 2) working directly on the JSON for the exploration of the graph.

Parallelize downloads when possible

For instance, downloading several files can be parallelized.

In mlcommons/croissant/python/ml_croissant/ml_croissant/_src/datasets.py, use topological sort by generations:

        for generation in nx.topological_generations(operations):
            for operation in generation:

Add support for splits to Croissant format

Dataset authors should be able to specify splits (eg: train/test/validation) by which data is grouped, whether that is through grouping of the resources, or providing the information through a CSV column, some naming pattern, etc.

Joins are well handled

In datasets/simple-join/output.jsonl, the last line is:

{"publications_by_user": {"title": null, "author_email": "[email protected]", "author_fullname": "Mary Jones"}}

We shouldn't have this line as the title is None.

Support Python>=3.9

We want to follow NEP 29.

Currently, tests fail for Python==3.9. See comment in the CI YAML file.

Add example of dataset with bounding boxes

Clean up representation of identifiers and references

In this discussion, a better representation of identifiers and references was proposed:

Dropping the "#{}" syntax for references, since it's always clear from the context that this is a refefence.
Splitting the container / field references into separate structured properties under "source"
using a "selector" property to specify the selection method (e.g., csv column, json query, etc.)

This proposal need to be reflected in the spec, the examples, and supported in the validator.

Better static analysis of the Croissant files in the verifier

Some advanced types (e.g., URLs) can be better type checked (e.g., URL has the right form).

Labels documentation

https://github.com/mlcommons/croissant/blob/main/docs/howto/labels.md

Example: https://github.com/mlcommons/croissant/blob/main/datasets/titanic/metadata.json
Maybe we want to add a dataset with labels in the sense of classification.

Relation to RO-Crate

Hi,

I couldn't find any information about it, but has the project considered the use of RO-Crate, which pretty much targets the same goals as Croissant and is also based in Schema.org?

Thanks!

Use mlcroissant in Hugging Face to load HF datasets from Croissant files

Automatically push to PyPI

PyPI: mlcroissant
Rename ml_croissant -> mlcroissant
Github actions

Use mlcroissant in TFDS to load TFDS datasets from Croissant files

dataType can be a list in the Python library

There is a bug, when dataType is a list and the first element is not a base dataType (like sc:Text).

Example:

          "dataType": [
            "sc:Text",
            "wd:Q3985153"
          ],

wd:Q3985153 is taken into account, but not sc:Text.

Documentation about `Dataset` in croissant-spec.md

croissant-spec.md doesn't document what a Dataset is according to Croissant.

Notably, it should document distribution and recordSet, and mandatory/optional attributes.

cc @benjelloun

Technical debt: Class StructureGraph

A lot of methods in python/ml_croissant/ml_croissant/_src/structure_graph/graph.py share common data structures (issues, graph, folder, etc), so they should be grouped under a common StructureGraph class.

In <Download>, check SHA256 or MD5.

Check either for SHA256 or MD5 checksums except for git FileObject, e.g. in datasets/huggingface-mnist/metadata.json.
Add unit tests.

Add back movielens/metadata.json to the CI

In #100, I had to remove movielens from the CI.

Indeed, we need to implement the fact that data_type is optional, when a field refers to another field with a type.

Add unit tests.

Generate OperationGraph for MovieLens

Remove the early return for MovieLens in python/ml_croissant/ml_croissant/_src/datasets.py.

Structure of the Python files

Use the same structure as JAX for the Python files:

ml_croissant
  |_ __init__.py  # Here main imports for the external API: atm only Dataset
  |_ _src
    |_ dataset.py  # Contains `Dataset`
    |_ record.py  # Contains `Record`
    |_ data_type.py
    |_ issues.py
    |_ operation_graph
      |_ graph.py  # Contains `ComputationGraph`
      |_ operations
        |_ download.py
        |_ read.py
        |_ ...  # Other operations
    |_ structure_graph
      |_ graph.py  # Contains `build_structure_graph`
      |_ nodes
        |_ __init__.py  # Contains the main class `Node`
        |_ metadata.py
        |_ field.py
        |_ file_object.py
        |_ file_set.py

All test files are right next to the file (e.g., issues.py is with issues_test.py).

dataType should accept a list

Currently it's parsed as a string. In python/ml_croissant/ml_croissant/_src/structure_graph/graph.py, replace:

            if isinstance(_object, term.Literal):
                node_params[croissant_key] = str(_object)

            if isinstance(_object, term.Literal):
                node_params[croissant_key].append(str(_object))

Formatter for JSON

Integration in VS Code
Allows to format all JSON-LD the same way

How-to for segmentation masks

Bounding boxes documentation

https://github.com/mlcommons/croissant/blob/main/docs/howto/bounding-boxes.md

Example: https://github.com/mlcommons/croissant/blob/main/datasets/coco2014/metadata.json

Support bounding boxes in verifier / loader

Dataset with bounding boxes: https://github.com/mlcommons/croissant/blob/main/datasets/coco2014/metadata.json

Technical debt: Remove filtering of `node_params`

We should be able to have all node_params right, without having to filter them.

Add back wiki-text/metadata.json to the CI

WikiText introduces applyTransform as a list. As a consequence I had to remove it from the CI in #100, as this had to be implemented separately.

The change has to be done in https://github.com/mlcommons/croissant/blob/main/python/ml_croissant/ml_croissant/_src/structure_graph/nodes/source.py.

This could be the good moment to add unit tests for python/ml_croissant/ml_croissant/_src/structure_graph/nodes/source.py.

All functions have a docstring and this is enforced by the linter

cc @ccl-core

Croissant diagram

To help users understand Croissant dataset descriptions, it would be useful to generate diagrams that represent the contents of a dataset.

A croissant diagram could consist of two layers:

(Bottom) Resources: Graphical rendition of the FileObject and Fileset contents of a dataset, with links that represent their dependencies. (e.g., a set of images FileSet extracted from an archive FileObject)
(Top) The RecordSets defined in the dataset, with the Field entries they contain, and links to the sources of their data (FileObject and FileSet in the resources layer).

Such diagrams can be included in the documentation of a dataset, in the croissant dataset viewer.

To generate them, we can rely on an existing package such as mermaid, or nomnoml. These packages rely on a textual representation of the diagram, which can easily be generated from the validator, based on the object representation it creates when it parses a croissant dataset.

DCF format checker

A tool/library to parse a DCF config and list errors, warning and suggestions for improvements.

For example:

having a RecordSet's source field set to a non existing file should raise an error (eg: "source X does not exist, did you mean Y?").
not specifying a file size should be a warning, and should come with a suggestion to add the size with reason(s) (eg: "specifying a file size - even if approximate - allows tools to predict download time and display progress to users", "specifying a file exact size allows tools to more safely check the file integrity, together with the file checksum").

Up for debate, but it might be worth to have different levels of errors / warnings, so some errors/warning can be ignored in some contexts (eg: while loading a DCF config to read a dataset) and raised in others (eg: while validating a new DCF config).

The checker tool/library should be re-usable from different entry points as much as possible (CLI, web UI, library).

mlcommons / croissant Goto Github PK

croissant's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs