GithubHelp home page GithubHelp logo

mlcommons / croissant Goto Github PK

View Code? Open in Web Editor NEW
258.0 22.0 29.0 212.82 MB

Croissant is a high-level format for machine learning datasets that brings together four rich layers.

Home Page: https://mlcommons.org/croissant

License: Apache License 2.0

Python 49.48% Makefile 0.16% Jupyter Notebook 38.28% HTML 9.85% JavaScript 1.18% Shell 0.10% TypeScript 0.89% Dockerfile 0.04%
datasets json-ld machine-learning schema-org

croissant's People

Contributors

aidazolic avatar benjelloun avatar bollacker avatar ccl-core avatar dependabot[bot] avatar dominik-kuhn avatar goeffthomas avatar guschmue avatar joangi avatar josvandervelde avatar luisoala avatar marcenacp avatar mkuchnik avatar monke6942021 avatar morphine00 avatar nathanw-mlc avatar petermattson avatar pgijsbers avatar pierrot0 avatar st3v0bay avatar thekanter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

croissant's Issues

New fields for FileObject

As described in the DCF spec, we want fields on the FileObject that are not part of the parent class (CreativeWork).

  • containedIn
  • contentUrl
  • contentSize
  • sameAs

Moreover, the name should not contain special characters, enabling us to use it to identify this FileObject.

Relation to RO-Crate

Hi,

I couldn't find any information about it, but has the project considered the use of RO-Crate, which pretty much targets the same goals as Croissant and is also based in Schema.org?

Thanks!

Croissant diagram

To help users understand Croissant dataset descriptions, it would be useful to generate diagrams that represent the contents of a dataset.

A croissant diagram could consist of two layers:

  1. (Bottom) Resources: Graphical rendition of the FileObject and Fileset contents of a dataset, with links that represent their dependencies. (e.g., a set of images FileSet extracted from an archive FileObject)
  2. (Top) The RecordSets defined in the dataset, with the Field entries they contain, and links to the sources of their data (FileObject and FileSet in the resources layer).

Such diagrams can be included in the documentation of a dataset, in the croissant dataset viewer.

To generate them, we can rely on an existing package such as mermaid, or nomnoml. These packages rely on a textual representation of the diagram, which can easily be generated from the validator, based on the object representation it creates when it parses a croissant dataset.

Cleaner unit tests in golden files

In unit tests, compare pairs (input JSON, output in the command line) in golden files.

This also allows to have up-to-date documentation with good/bad examples.

OpenMLConverter: Add contentSize and hash

After review by Pierre Ruyssen:
It would be nice to add a checksum (sha256). This is critical for libraries such as TFDS who want to verify the integrity of files before processing them, both from a security and correctness perspective.

We should also set contentSize, as this would make it easier to show how much download to expect, download progress, etc.

https://schema.org/sha256 is pending implementation feedback but I suppose we can use it. TFDS also uses MD5, and while re-computing the sha256 is possible, I don't know how much friction it would add to the various actors...

TODO:

  • check if we can return contentSize from OpenML (OpenML currently does not give this info)
  • check which checksum is returned by OpenML. Is it the .arff or the .pq file?

Human-readable pkl files in tests

In python/ml_croissant/ml_croissant/_src/datasets_test.py, we compare the loaded data to a golden pkl file.

This pkl file is a binary, so it's not human-readable. Try to find a human-readable, e.g. JSON-L. It's not trivial because pandas values (like Timestamp or nan) are not pickable, but it should be easy enough to be a good first issue.

DCF format checker

A tool/library to parse a DCF config and list errors, warning and suggestions for improvements.

For example:

  • having a RecordSet's source field set to a non existing file should raise an error (eg: "source X does not exist, did you mean Y?").
  • not specifying a file size should be a warning, and should come with a suggestion to add the size with reason(s) (eg: "specifying a file size - even if approximate - allows tools to predict download time and display progress to users", "specifying a file exact size allows tools to more safely check the file integrity, together with the file checksum").

Up for debate, but it might be worth to have different levels of errors / warnings, so some errors/warning can be ignored in some contexts (eg: while loading a DCF config to read a dataset) and raised in others (eg: while validating a new DCF config).

The checker tool/library should be re-usable from different entry points as much as possible (CLI, web UI, library).

Joins are well handled

In datasets/simple-join/output.jsonl, the last line is:

{"publications_by_user": {"title": null, "author_email": "[email protected]", "author_fullname": "Mary Jones"}}

We shouldn't have this line as the title is None.

Structure of the Python files

Use the same structure as JAX for the Python files:

ml_croissant
  |_ __init__.py  # Here main imports for the external API: atm only Dataset
  |_ _src
    |_ dataset.py  # Contains `Dataset`
    |_ record.py  # Contains `Record`
    |_ data_type.py
    |_ issues.py
    |_ operation_graph
      |_ graph.py  # Contains `ComputationGraph`
      |_ operations
        |_ download.py
        |_ read.py
        |_ ...  # Other operations
    |_ structure_graph
      |_ graph.py  # Contains `build_structure_graph`
      |_ nodes
        |_ __init__.py  # Contains the main class `Node`
        |_ metadata.py
        |_ field.py
        |_ file_object.py
        |_ file_set.py

All test files are right next to the file (e.g., issues.py is with issues_test.py).

Manage issues for both static and dynamic analysis

Currently:

  • During static analysis: we gather as many issues as possible by populating an object Issue.
  • During dynamic analysis: we run assert to check AssertionError. assert statements are not executed when python is run with the optimize flag (-O).

We should find a unified way to deal with issues.

Parallelize downloads when possible

For instance, downloading several files can be parallelized.

In mlcommons/croissant/python/ml_croissant/ml_croissant/_src/datasets.py, use topological sort by generations:

        for generation in nx.topological_generations(operations):
            for operation in generation:

Partition support

There is already a howto about splits (https://github.com/mlcommons/croissant/blob/main/docs/howto/specify-splits.md) and an example (https://github.com/mlcommons/croissant/blob/main/datasets/coco2014/metadata.json).

However we also want support for other types of partitions, namely dated partitions and languages (eg: wikipedia).

Currently there is no support for partitions in the validator / loader. We should make sure it is possible to retrieve a single (or a few) partition(s) and only download the required files. We should also make sure it is possible to retrieve many partitions (not just one language for example).

There is no existing howto page for partitions, but I think we need one.

Technical debt: Class StructureGraph

A lot of methods in python/ml_croissant/ml_croissant/_src/structure_graph/graph.py share common data structures (issues, graph, folder, etc), so they should be grouped under a common StructureGraph class.

Distinguish `sources` from `references` in the codebase

Currently we define sources and references using the same class:

class Source:
    reference: tuple[str] = ()
    apply_transform_regex: str | None = None
    apply_transform_separator: str | None = None

Source.reference is a tuple indicating:

  • Another field: whose UID would be the concatenation of all the str in reference.
  • A CSV/JSON column: the UID is reference[0] and the column name is reference[1].

Problems:

  • Naming: sources and references are here used with very different meanings.
  • Source.reference is ambiguous and hard to use.

Instead, we would like something like:

class Origin:
    origin_uid: str  # UID in the structure graph
    column: str | None = None  # column if the origin references a CSV or a JSON.
    apply_transform_regex: str | None = None
    apply_transform_separator: str | None = None

dataType should accept a list

Currently it's parsed as a string. In python/ml_croissant/ml_croissant/_src/structure_graph/graph.py, replace:

            if isinstance(_object, term.Literal):
                node_params[croissant_key] = str(_object)

by

            if isinstance(_object, term.Literal):
                node_params[croissant_key].append(str(_object))

Generate datasets should be reproducible

At the moment, when loading two times the same dataset, we may not yield the elements in the same order.

To test this in python/ml_croissant/ml_croissant/_src/datasets_test.py, replace:

    for record in records:
        assert _there_exists_an_equal_dict(record, expected_records)

by

    for i, record in enumerate(records):
        assert _dicts_are_equal(record, expected_records[i])

Croissant to Croissant converter

We need a Croissant to Croissant converter to introduce new syntaxes: the converter reads the old Croissant configs and produces configs using the latest syntax.

Such a tool would for example be useful in the context of:

  • #51, to quickly update all configs as we iterate on reference syntax, or
  • #58 and #91

Refactor ComputationGraph.from_nodes

This method traverses the structure graph to build the ComputationGraph.

Possible improvements:

  • Split in several smaller functions. Notably, each if cases could be handled in independent functions.
  • Traverse the graph more optimally to avoid manually generated if cases.

Add support for splits to Croissant format

Dataset authors should be able to specify splits (eg: train/test/validation) by which data is grouped, whether that is through grouping of the resources, or providing the information through a CSV column, some naming pattern, etc.

Add support for enumerations

In this discussion, we reached a consensus on how to represent enumerations at the RecordSet and Field level, by introducing an "isEnumeration" boolean property.

We should add that mechanism to the spec, examples, and validator.

dataType can be a list in the Python library

There is a bug, when dataType is a list and the first element is not a base dataType (like sc:Text).

Example:

          "dataType": [
            "sc:Text",
            "wd:Q3985153"
          ],

wd:Q3985153 is taken into account, but not sc:Text.

Clean up representation of identifiers and references

In this discussion, a better representation of identifiers and references was proposed:

  • Dropping the "#{}" syntax for references, since it's always clear from the context that this is a refefence.
  • Splitting the container / field references into separate structured properties under "source"
  • using a "selector" property to specify the selection method (e.g., csv column, json query, etc.)

This proposal need to be reflected in the spec, the examples, and supported in the validator.

Use *.ttl files

*.ttl files are used by schema.org to:

  • Generate the HTML documentation
  • Declare new types that are not defined in schema.org.

Instead of doing this, at this point, we manually extract properties by iterating on the RDF graph. This is a temporary solution to bootstrap the lib, but should be replaced by ttl files.

Remove the usage of rdflib.Graph and parse the JSON directly.

Until now, we have used rdflib.Graph to parse the JSON into an RDF graph. This graph is then parsed.

We've used rdflib out of convenience. We could remove this additional complexity by 1) expanding all key/value of the JSON using RDFLib, and 2) working directly on the JSON for the exploration of the graph.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.