mlcommons / croissant Goto Github PK
View Code? Open in Web Editor NEWCroissant is a high-level format for machine learning datasets that brings together four rich layers.
Home Page: https://mlcommons.org/croissant
License: Apache License 2.0
Croissant is a high-level format for machine learning datasets that brings together four rich layers.
Home Page: https://mlcommons.org/croissant
License: Apache License 2.0
ReadCsv
becomes Read
and supports:
Currently we define sources and references using the same class:
class Source:
reference: tuple[str] = ()
apply_transform_regex: str | None = None
apply_transform_separator: str | None = None
Source.reference
is a tuple indicating:
reference
.reference[0]
and the column name is reference[1]
.Problems:
Source.reference
is ambiguous and hard to use.Instead, we would like something like:
class Origin:
origin_uid: str # UID in the structure graph
column: str | None = None # column if the origin references a CSV or a JSON.
apply_transform_regex: str | None = None
apply_transform_separator: str | None = None
After review by Pierre Ruyssen:
It would be nice to add a checksum (sha256). This is critical for libraries such as TFDS who want to verify the integrity of files before processing them, both from a security and correctness perspective.
We should also set contentSize, as this would make it easier to show how much download to expect, download progress, etc.
https://schema.org/sha256 is pending implementation feedback but I suppose we can use it. TFDS also uses MD5, and while re-computing the sha256 is possible, I don't know how much friction it would add to the various actors...
TODO:
.arff
or the .pq
file?Originally posted by @benjelloun in #34 (comment)
Omar I assign to you to check whether external context files are supported by search engines.
Thanks!
This method traverses the structure graph to build the ComputationGraph.
Possible improvements:
if
cases could be handled in independent functions.if
cases.At the moment, when loading two times the same dataset, we may not yield the elements in the same order.
To test this in python/ml_croissant/ml_croissant/_src/datasets_test.py, replace:
for record in records:
assert _there_exists_an_equal_dict(record, expected_records)
by
for i, record in enumerate(records):
assert _dicts_are_equal(record, expected_records[i])
Example: generate big datasets like the C4 dataset.
In python/ml_croissant/ml_croissant/_src/datasets_test.py, we compare the loaded data to a golden pkl file.
This pkl file is a binary, so it's not human-readable. Try to find a human-readable, e.g. JSON-L. It's not trivial because pandas
values (like Timestamp or nan) are not pickable, but it should be easy enough to be a good first issue.
Versioning includes a few problems:
Related discussion: #58
Related documentation to write: https://github.com/mlcommons/croissant/blob/main/docs/howto/versions.md
We need a Croissant to Croissant converter to introduce new syntaxes: the converter reads the old Croissant configs and produces configs using the latest syntax.
Such a tool would for example be useful in the context of:
Currently:
Issue
.assert
to check AssertionError. assert
statements are not executed when python is run with the optimize flag (-O
).We should find a unified way to deal with issues.
In unit tests, compare pairs (input JSON, output in the command line) in golden files.
This also allows to have up-to-date documentation with good/bad examples.
As described in the DCF spec, we want fields on the FileObject that are not part of the parent class (CreativeWork).
Moreover, the name
should not contain special characters, enabling us to use it to identify this FileObject.
node.add_error
/node.add_warning
give more context than issues.add_error
/issues.add_warning
.
Replace when possible in the whole codebase to have better error/warning messages.
In this discussion, we reached a consensus on how to represent enumerations at the RecordSet and Field level, by introducing an "isEnumeration" boolean property.
We should add that mechanism to the spec, examples, and validator.
*.ttl files are used by schema.org to:
Instead of doing this, at this point, we manually extract properties by iterating on the RDF graph. This is a temporary solution to bootstrap the lib, but should be replaced by ttl files.
There is already a howto about splits (https://github.com/mlcommons/croissant/blob/main/docs/howto/specify-splits.md) and an example (https://github.com/mlcommons/croissant/blob/main/datasets/coco2014/metadata.json).
However we also want support for other types of partitions, namely dated partitions and languages (eg: wikipedia).
Currently there is no support for partitions in the validator / loader. We should make sure it is possible to retrieve a single (or a few) partition(s) and only download the required files. We should also make sure it is possible to retrieve many partitions (not just one language for example).
There is no existing howto page for partitions, but I think we need one.
Until now, we have used rdflib.Graph
to parse the JSON into an RDF graph. This graph is then parsed.
We've used rdflib out of convenience. We could remove this additional complexity by 1) expanding all key/value of the JSON using RDFLib, and 2) working directly on the JSON for the exploration of the graph.
For instance, downloading several files can be parallelized.
In mlcommons/croissant/python/ml_croissant/ml_croissant/_src/datasets.py, use topological sort by generations:
for generation in nx.topological_generations(operations):
for operation in generation:
Dataset authors should be able to specify splits (eg: train/test/validation) by which data is grouped, whether that is through grouping of the resources, or providing the information through a CSV column, some naming pattern, etc.
In datasets/simple-join/output.jsonl, the last line is:
{"publications_by_user": {"title": null, "author_email": "[email protected]", "author_fullname": "Mary Jones"}}
We shouldn't have this line as the title is None
.
We want to follow NEP 29.
Currently, tests fail for Python==3.9. See comment in the CI YAML file.
In this discussion, a better representation of identifiers and references was proposed:
This proposal need to be reflected in the spec, the examples, and supported in the validator.
Some advanced types (e.g., URLs) can be better type checked (e.g., URL has the right form).
https://github.com/mlcommons/croissant/blob/main/docs/howto/labels.md
Example: https://github.com/mlcommons/croissant/blob/main/datasets/titanic/metadata.json
Maybe we want to add a dataset with labels in the sense of classification.
Hi,
I couldn't find any information about it, but has the project considered the use of RO-Crate, which pretty much targets the same goals as Croissant and is also based in Schema.org?
Thanks!
mlcroissant
ml_croissant
-> mlcroissant
There is a bug, when dataType is a list and the first element is not a base dataType (like sc:Text).
Example:
"dataType": [
"sc:Text",
"wd:Q3985153"
],
wd:Q3985153
is taken into account, but not sc:Text
.
croissant-spec.md
doesn't document what a Dataset is according to Croissant.
Notably, it should document distribution
and recordSet
, and mandatory/optional attributes.
cc @benjelloun
A lot of methods in python/ml_croissant/ml_croissant/_src/structure_graph/graph.py share common data structures (issues, graph, folder, etc), so they should be grouped under a common StructureGraph
class.
In #100, I had to remove movielens from the CI.
Indeed, we need to implement the fact that data_type
is optional, when a field refers to another field with a type.
Add unit tests.
Remove the early return for MovieLens in python/ml_croissant/ml_croissant/_src/datasets.py.
Use the same structure as JAX for the Python files:
ml_croissant
|_ __init__.py # Here main imports for the external API: atm only Dataset
|_ _src
|_ dataset.py # Contains `Dataset`
|_ record.py # Contains `Record`
|_ data_type.py
|_ issues.py
|_ operation_graph
|_ graph.py # Contains `ComputationGraph`
|_ operations
|_ download.py
|_ read.py
|_ ... # Other operations
|_ structure_graph
|_ graph.py # Contains `build_structure_graph`
|_ nodes
|_ __init__.py # Contains the main class `Node`
|_ metadata.py
|_ field.py
|_ file_object.py
|_ file_set.py
All test files are right next to the file (e.g., issues.py
is with issues_test.py
).
Currently it's parsed as a string. In python/ml_croissant/ml_croissant/_src/structure_graph/graph.py, replace:
if isinstance(_object, term.Literal):
node_params[croissant_key] = str(_object)
by
if isinstance(_object, term.Literal):
node_params[croissant_key].append(str(_object))
Dataset with bounding boxes: https://github.com/mlcommons/croissant/blob/main/datasets/coco2014/metadata.json
We should be able to have all node_params right, without having to filter them.
WikiText introduces applyTransform
as a list. As a consequence I had to remove it from the CI in #100, as this had to be implemented separately.
The change has to be done in https://github.com/mlcommons/croissant/blob/main/python/ml_croissant/ml_croissant/_src/structure_graph/nodes/source.py.
This could be the good moment to add unit tests for python/ml_croissant/ml_croissant/_src/structure_graph/nodes/source.py
.
cc @ccl-core
To help users understand Croissant dataset descriptions, it would be useful to generate diagrams that represent the contents of a dataset.
A croissant diagram could consist of two layers:
Such diagrams can be included in the documentation of a dataset, in the croissant dataset viewer.
To generate them, we can rely on an existing package such as mermaid, or nomnoml. These packages rely on a textual representation of the diagram, which can easily be generated from the validator, based on the object representation it creates when it parses a croissant dataset.
A tool/library to parse a DCF config and list errors, warning and suggestions for improvements.
For example:
RecordSet
's source
field set to a non existing file should raise an error (eg: "source X does not exist, did you mean Y?").Up for debate, but it might be worth to have different levels of errors / warnings, so some errors/warning can be ignored in some contexts (eg: while loading a DCF config to read a dataset) and raised in others (eg: while validating a new DCF config).
The checker tool/library should be re-usable from different entry points as much as possible (CLI, web UI, library).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.