GithubHelp home page GithubHelp logo

redlink-gmbh / croissant Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mlcommons/croissant

0.0 1.0 0.0 4.04 MB

Croissant is a high-level format for machine learning datasets that brings together four rich layers.

Home Page: https://mlcommons.org/croissant

License: Apache License 2.0

Shell 0.10% JavaScript 1.19% Python 49.40% TypeScript 0.89% Makefile 0.16% HTML 9.88% Jupyter Notebook 38.34% Dockerfile 0.04%

croissant's Introduction

Croissant ๐Ÿฅ

CI Python 3.10+

Summary

Croissant ๐Ÿฅ is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools.

Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable.

Trying It Out

Croissant is currently under development by the community. You can try the Croissant implementation, mlcroissant:

Installation (requires Python 3.10+):

pip install mlcroissant

Loading an example dataset:

import mlcroissant as mlc
ds = mlc.Dataset("https://raw.githubusercontent.com/mlcommons/croissant/main/datasets/1.0/gpt-3/metadata.json")
metadata = ds.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")
for x in ds.records(record_set="default"):
    print(x)

Please see the notebook recipes for more examples.

Why a standard format for ML datasets?

Datasets are the source code of machine learning (ML), but working with ML datasets is needlessly hard because each dataset has a unique file organization and method for translating file contents into data structures and thus requires a novel approach to using the data. We need a standard dataset format to make it easier to find and use ML datasets and especially to develop tools for creating, understanding, and improving ML datasets.

The Croissant Format

Croissant ๐Ÿฅ is a high-level format for machine learning datasets. Croissant brings together four rich layers (in a tasty manner, we hope ๐Ÿ˜‰):

  • Metadata: description of the dataset, including responsible ML aspects
  • Resources: one or more files or other sources containing the raw data
  • Structure: how the raw data is combined and arranged into data structures for use
  • ML semantics: how the data is most often used in an ML context

Simple Format Example

Here is an extremely simple example of the Croissant format, with comments showing the four layers:

{
  "@type": "sc:Dataset",
  "name": "minimal_example_with_recommended_fields",
  "description": "This is a minimal example, including the required and the recommended fields.",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "url": "https://example.com/dataset/recipes/minimal-recommended",
  "distribution": [
    {
      "@type": "sc:FileObject",
      "@id": "minimal.csv",
      "name": "minimal.csv",
      "contentUrl": "data/minimal.csv",
      "encodingFormat": "text/csv",
      "sha256": "48a7c257f3c90b2a3e529ddd2cca8f4f1bd8e49ed244ef53927649504ac55354"
    }
  ],
  "recordSet": [
    {
      "@type": "ml:RecordSet",
      "name": "examples",
      "description": "Records extracted from the example table, with their schema.",
      "field": [
        {
          "@type": "ml:Field",
          "name": "name",
          "description": "The first column contains the name.",
          "dataType": "sc:Text",
          "references": {
            "fileObject": {"@id": "minimal.csv"},
            "extract": {
              "column": "name"
            }
          }
        },
        {
          "@type": "ml:Field",
          "name": "age",
          "description": "The second column contains the age.",
          "dataType": "sc:Integer",
          "references": {
            "fileObject": {"@id": "minimal.csv"},
            "extract": {
              "column": "age"
            }
          }
        }
      ]
    }
  ]
}

Resources

Getting involved

  • Join the mailing list
  • Attend Croissant meetings (please joint the list to automatically receive the invite)
  • File issues for bugs for feature requests
  • Contribute code (please sign the MLCommons Association CLA first!)

Integrations

Licensing

Croissant project code and examples are licensed under Apache 2.

Governance

Croissant is being developed by the community as a Task Force of the MLCommons Association Datasets Working Group. The Task Force is open to anyone (as is the parent Datasets working group). The Task Force is co-chaired by Omar Benjelloun and Elena Simperl.

Contributors

Albert Villanova (Hugging Face), Andrew Zaldivar (Google), Baishan Guo (Meta), Carole Jean-Wu (Meta), Ce Zhang (ETH Zurich), Costanza Conforti (Google), D. Sculley (Kaggle), Dan Brickley (Schema.Org), Eduardo Arino de la Rubia (Meta), Edward Lockhart (Deepmind), Elena Simperl (King's College London), Goeff Thomas (Kaggle), Joan Giner-Miguelez (UOC), Joaquin Vanschoren (TU/Eindhoven, OpenML), Jos van der Velde (TU/Eindhoven, OpenML), Julien Chaumond (Hugging Face), Kurt Bollacker (MLCommons), Lora Aroyo (Google), Luis Oala (Dotphoton), Meg Risdal (Kaggle), Natasha Noy (Google), Newsha Ardalani (Meta), Omar Benjelloun (Google), Peter Mattson (MLCommons), Pierre Marcenac (Google), Pierre Ruyssen (Google), Pieter Gijsbers (TU/Eindhoven, OpenML), Prabhant Singh (TU/Eindhoven, OpenML), Quentin Lhoest (Hugging Face), Steffen Vogler (Bayer), Taniya Das (TU/Eindhoven, OpenML), Michael Kuchnik (Meta)

Thank you for supporting Croissant! ๐Ÿ™‚

croissant's People

Contributors

marcenacp avatar ccl-core avatar josvandervelde avatar mkuchnik avatar benjelloun avatar pierrot0 avatar aidazolic avatar goeffthomas avatar petermattson avatar nathanw-mlc avatar pgijsbers avatar luisoala avatar joangi avatar morphine00 avatar st3v0bay avatar dependabot[bot] avatar guschmue avatar thekanter avatar dominik-kuhn avatar bollacker avatar monke6942021 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.