GithubHelp home page GithubHelp logo

dagshub / open-source-ml-datasets Goto Github PK

View Code? Open in Web Editor NEW
8.0 4.0 8.0 7.53 MB

This repository holds open source datasets for various machine learning domains with a link to download and use them

Home Page: https://dagshub.com/DagsHub/open-source-ml-datasets

hacktoberfest hacktoberfest2023 ai dagshub data data-engine dataset machine-learning mlops open-source

open-source-ml-datasets's Introduction

DagsHub Client


Tests pip License Python Version DagsHub Docs DagsHub Client Docs

DagsHub Sign Up Discord DagsHub on Twitter

What is DagsHub?

DagsHub is a platform where machine learning and data science teams can build, manage, and collaborate on their projects. With DagsHub you can:

  1. Version code, data, and models in one place. Use the free provided DagsHub storage or connect it to your cloud storage
  2. Track Experiments using Git, DVC or MLflow, to provide a fully reproducible environment
  3. Visualize pipelines, data, and notebooks in and interactive, diff-able, and dynamic way
  4. Label your data directly on the platform using Label Studio
  5. Share your work with your team members
  6. Stream and upload your data in an intuitive and easy way, while preserving versioning and structure.

DagsHub is built firmly around open, standard formats for your project. In particular:

Therefore, you can work with DagsHub regardless of your chosen programming language or frameworks.

DagsHub Client API & CLI

This client library is meant to help you get started quickly with DagsHub. It is made up of Experiment tracking and Direct Data Access (DDA), a component to let you stream and upload your data.

For more details on the different functions of the client, check out the docs segments:

  1. Installation & Setup
  2. Data Streaming
  3. Data Upload
  4. Experiment Tracking
    1. Autologging
  5. Data Engine

Some functionality is supported only in Python.

To read about some of the awesome use cases for Direct Data Access, check out the relevant doc page.

Installation

pip install dagshub

Direct Data Access (DDA) functionality requires authentication, which you can easily do by running the following command in your terminal:

dagshub login

Quickstart for Data Streaming

The easiest way to start using DagsHub is via the Python Hooks method. To do this:

  1. Your DagsHub project,
  2. Copy the following 2 lines of code into your Python code which accesses your data:
    from dagshub.streaming import install_hooks
    install_hooks()
  3. That’s it! You now have streaming access to all your project files.

🤩 Check out this colab to see an example of this Data Streaming work end to end:

Open In Colab

Next Steps

You can dive into the expanded documentation, to learn more about data streaming, data upload and experiment tracking with DagsHub


Analytics

To improve your experience, we collect analytics on client usage. If you want to disable analytics collection, set the DAGSHUB_DISABLE_ANALYTICS environment variable to any value.

Made with 🐶 by DagsHub.

open-source-ml-datasets's People

Contributors

ln11211 avatar lunarmarathon avatar nirbarazida avatar plon-susk7 avatar pr-peri avatar sookeyy-12 avatar syedzubeen avatar yonomitt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

open-source-ml-datasets's Issues

CNN News Articles (2011-2022)

Dataset Details:
The dataset, which has undergone initial cleaning, comprises CNN News Articles spanning from 2011 to 2022. It consists of two essential components: the category labels and the complete article text.

This dataset was sourced from Kaggle and can be accessed via the following URL: https://www.kaggle.com/datasets/hadasu92/cnn-articles-after-basic-cleaning. It has been divided into two distinct sets:

A training set containing 32,218 examples.
A test set containing 5,686 examples.

Dataset URL:
https://huggingface.co/datasets/AyoubChLin/CNN_News_Articles_2011-2022

Flights Dataset

The "flights.csv" dataset contains information about the flights of an airport Kaggle.

Amazon: Review Polarity

Dataset Details:
The Amazon reviews dataset comprises reviews sourced from Amazon, covering an extensive 18-year timeframe with approximately 35 million reviews recorded until March 2013. These reviews encompass a wealth of information, encompassing details about products, users, ratings, and the actual text of the reviews.

Dataset URL:
https://huggingface.co/datasets/amazon_polarity

PUBHEALTH Dataset

Dataset Details:
PUBHEALTH is an extensive dataset designed for the purpose of facilitating explainable automated fact-checking of public health claims. Within the PUBHEALTH dataset, each entry is assigned a veracity label (true, false, unproven, mixture). Additionally, every entry includes an explanation text, which serves as a rationale for the specific veracity label assigned to the claim.

Dataset URL:
https://huggingface.co/datasets/health_fact

disffusiondb dataset

DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users.

DiffusionDB is publicly available at 🤗 Hugging Face Dataset.

BBC Hindi NLI Dataset

Dataset Details:
The BBC Hindi Dataset is designed for Natural Language Inference tasks in the Hindi language. It comprises textual-entailment pairs, with each row containing four columns: Premise, Hypothesis, Label, and Topic. The context and hypothesis are presented in Hindi, while the entailment label is provided in English, categorized into two types: "entailed" and "not entailed." This dataset serves as a valuable resource for training models in the domain of Natural Language Inference in the Hindi language.

Dataset URL:
https://huggingface.co/datasets/bbc_hindi_nli

All Scam Spam

Dataset Details:
This dataset comprises a substantial collection of 42,619 text messages and emails that have undergone preprocessing, originating from individuals conversing in 43 different languages. In this dataset, "is_spam=1" signifies spam, while "is_spam=0" indicates non-spam (ham).

A set of 1,040 rows of balanced data, encompassing casual conversations and fraudulent email communications in approximately 10 languages, was meticulously gathered and annotated by me, with some collaboration from ChatGPT.

Dataset URL:
https://huggingface.co/datasets/FredZhang7/all-scam-spam

Honest Dataset

Dataset Details:
The HONEST dataset offers templates designed to assess hurtful sentence completions in language models. These templates are available in six languages (English, Italian, French, Portuguese, Romanian, and Spanish) for binary gender and in English for LGBTQAI+ individuals. This dataset includes offensive and/or hateful content.

Dataset URL:
https://huggingface.co/datasets/MilaNLProc/honest

Clinical Trials: Reason To Stop

Dataset Details:
Within this dataset, you'll find a meticulously categorized compilation of over 5000 explanations behind premature terminations of clinical trials. These explanations have been sourced from clinicaltrials.gov, the foremost repository of clinical trial data, and have been carefully organized by contributors from the Open Targets organization, a project focused on furnishing drug development-relevant data.

Dataset URL:
https://huggingface.co/datasets/opentargets/clinical_trial_reason_to_stop

Chess Game Dataset (Lichess)

The Lichess Chess Game Dataset, 20,000+ Lichess Games, including moves, victor, rating, opening details and more taken from Kaggle.

Tire Quality dataset

The dataset contains digital tire images, categorized into two classes: defective and good condition for quality inspection and classification. Taken from Kaggle.

EA Sports FIFA dataset

The datasets provided include the players data for the Career Mode from FIFA 15 to EA Sports FC 24. The data allows multiple comparisons for the same players across the last 10 versions of the videogame. It is taken form kaggle.

Multi-Dimensional Gender Bias Classification Dataset

Dataset Details:
The Multi-Dimensional Gender Bias Classification dataset is constructed using a comprehensive framework that dissects gender bias within text across various pragmatic and semantic aspects. This includes bias related to the gender of the individual mentioned, the gender of the individual addressed, and the gender of the speaker. This dataset encompasses seven extensive datasets that have been automatically labeled with gender-related information (note that the HuggingFace distribution omits one dataset from the original project, which is the Wikipedia set). Additionally, it includes a crowdsourced evaluation benchmark for gender rewrites at the utterance level, a compilation of gendered names, and a catalog of gendered English words.

Dataset URL:
https://huggingface.co/datasets/md_gender_bias

CNN Dailymail Dataset

Dataset Details:
The CNN/DailyMail Dataset, composed of approximately 300,000 distinct news articles authored by journalists from CNN and the Daily Mail, primarily facilitates extractive and abstractive summarization. Nevertheless, its initial purpose was to aid machine reading and comprehension, as well as abstractive question answering.

Dataset URL:
https://huggingface.co/datasets/ccdv/cnn_dailymail

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.