dagshub / open-source-ml-datasets Goto Github PK

This repository holds open source datasets for various machine learning domains with a link to download and use them

Home Page: https://dagshub.com/DagsHub/open-source-ml-datasets

hacktoberfest hacktoberfest2023 ai dagshub data data-engine dataset machine-learning mlops open-source

open-source-ml-datasets's Introduction

What is DagsHub?

DagsHub is a platform where machine learning and data science teams can build, manage, and collaborate on their projects. With DagsHub you can:

Version code, data, and models in one place. Use the free provided DagsHub storage or connect it to your cloud storage
Track Experiments using Git, DVC or MLflow, to provide a fully reproducible environment
Visualize pipelines, data, and notebooks in and interactive, diff-able, and dynamic way
Label your data directly on the platform using Label Studio
Share your work with your team members
Stream and upload your data in an intuitive and easy way, while preserving versioning and structure.

DagsHub is built firmly around open, standard formats for your project. In particular:

Git
DVC
MLflow
Label Studio
Standard data formats like YAML, JSON, CSV

Therefore, you can work with DagsHub regardless of your chosen programming language or frameworks.

DagsHub Client API & CLI

This client library is meant to help you get started quickly with DagsHub. It is made up of Experiment tracking and Direct Data Access (DDA), a component to let you stream and upload your data.

For more details on the different functions of the client, check out the docs segments:

Some functionality is supported only in Python.

To read about some of the awesome use cases for Direct Data Access, check out the relevant doc page.

Installation

pip install dagshub

Direct Data Access (DDA) functionality requires authentication, which you can easily do by running the following command in your terminal:

dagshub login

Quickstart for Data Streaming

The easiest way to start using DagsHub is via the Python Hooks method. To do this:

Your DagsHub project,
Copy the following 2 lines of code into your Python code which accesses your data:
```
from dagshub.streaming import install_hooks
install_hooks()
```
That’s it! You now have streaming access to all your project files.

🤩 Check out this colab to see an example of this Data Streaming work end to end:

Next Steps

You can dive into the expanded documentation, to learn more about data streaming, data upload and experiment tracking with DagsHub

Analytics

To improve your experience, we collect analytics on client usage. If you want to disable analytics collection, set the DAGSHUB_DISABLE_ANALYTICS environment variable to any value.

Made with 🐶 by DagsHub.

open-source-ml-datasets's People

Contributors

Stargazers

Watchers

Forkers

mridul-2003 lunarmarathon ln11211 sookeyy-12 plon-susk7 syedzubeen pr-peri odeyiany2

open-source-ml-datasets's Issues

CNN News Articles (2011-2022)

Dataset Details:
The dataset, which has undergone initial cleaning, comprises CNN News Articles spanning from 2011 to 2022. It consists of two essential components: the category labels and the complete article text.

This dataset was sourced from Kaggle and can be accessed via the following URL: https://www.kaggle.com/datasets/hadasu92/cnn-articles-after-basic-cleaning. It has been divided into two distinct sets:

A training set containing 32,218 examples.
A test set containing 5,686 examples.

Dataset URL:
https://huggingface.co/datasets/AyoubChLin/CNN_News_Articles_2011-2022

Flights Dataset

The "flights.csv" dataset contains information about the flights of an airport Kaggle.

Stanford Human Preferences HF Dataset

the dataset can be found here: https://huggingface.co/datasets/stanfordnlp/SHP

Breast Cancer Wisconsin (Diagnostic) Data Set

students-performance dataset

To add the students-performance dataset from kaggle.

dialogsum Dataset

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics.
the dataset can be found here: https://huggingface.co/datasets/knkarthick/dialogsum

Amazon: Review Polarity

Dataset Details:
The Amazon reviews dataset comprises reviews sourced from Amazon, covering an extensive 18-year timeframe with approximately 35 million reviews recorded until March 2013. These reviews encompass a wealth of information, encompassing details about products, users, ratings, and the actual text of the reviews.

Dataset URL:
https://huggingface.co/datasets/amazon_polarity

GRIT HF dataset

GRIT: Large-Scale Training Corpus of Grounded Image-Text Pairs
the dataset can be found here: https://huggingface.co/datasets/zzliang/GRIT

PUBHEALTH Dataset

Dataset Details:
PUBHEALTH is an extensive dataset designed for the purpose of facilitating explainable automated fact-checking of public health claims. Within the PUBHEALTH dataset, each entry is assigned a veracity label (true, false, unproven, mixture). Additionally, every entry includes an explanation text, which serves as a rationale for the specific veracity label assigned to the claim.

Dataset URL:
https://huggingface.co/datasets/health_fact

Emotions Dataset from HF

Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.
the dataset can be found here: https://huggingface.co/datasets/dair-ai/emotion

disffusiondb dataset

DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users.

DiffusionDB is publicly available at 🤗 Hugging Face Dataset.

Ocular Toxoplasmosis Fundus Images Dataset

An Eye Disease dataset taken for Kaggle.

Updated the readme for Chest ct scan dataset

#74

BBC Hindi NLI Dataset

Dataset Details:
The BBC Hindi Dataset is designed for Natural Language Inference tasks in the Hindi language. It comprises textual-entailment pairs, with each row containing four columns: Premise, Hypothesis, Label, and Topic. The context and hypothesis are presented in Hindi, while the entailment label is provided in English, categorized into two types: "entailed" and "not entailed." This dataset serves as a valuable resource for training models in the domain of Natural Language Inference in the Hindi language.

Dataset URL:
https://huggingface.co/datasets/bbc_hindi_nli

awesome-chatgpt-prompts HF Dataset

This dataset can be found here

All Scam Spam

Dataset Details:
This dataset comprises a substantial collection of 42,619 text messages and emails that have undergone preprocessing, originating from individuals conversing in 43 different languages. In this dataset, "is_spam=1" signifies spam, while "is_spam=0" indicates non-spam (ham).

A set of 1,040 rows of balanced data, encompassing casual conversations and fraudulent email communications in approximately 10 languages, was meticulously gathered and annotated by me, with some collaboration from ChatGPT.

Dataset URL:
https://huggingface.co/datasets/FredZhang7/all-scam-spam

T20 Cricket Dataset for ML

Honest Dataset

Dataset Details:
The HONEST dataset offers templates designed to assess hurtful sentence completions in language models. These templates are available in six languages (English, Italian, French, Portuguese, Romanian, and Spanish) for binary gender and in English for LGBTQAI+ individuals. This dataset includes offensive and/or hateful content.

Dataset URL:
https://huggingface.co/datasets/MilaNLProc/honest

Dataset Card for CodeSearchNet corpus

CodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.
the dataset can be found here: https://huggingface.co/datasets/code_search_net

Clinical Trials: Reason To Stop

Dataset Details:
Within this dataset, you'll find a meticulously categorized compilation of over 5000 explanations behind premature terminations of clinical trials. These explanations have been sourced from clinicaltrials.gov, the foremost repository of clinical trial data, and have been carefully organized by contributors from the Open Targets organization, a project focused on furnishing drug development-relevant data.

Dataset URL:
https://huggingface.co/datasets/opentargets/clinical_trial_reason_to_stop

Synthetic Financial Fraud Detection Dataset

Synthetic tabular dataset generated by the PaySim mobile money simulator taken from Kaggle.

Twitter: Financial News Sentiment

Dataset Details:
The Twitter Financial News dataset is an English-language collection of finance-related tweets with annotations, serving the purpose of sentiment analysis for tweets associated with finance topics.

Dataset URL:
https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment

Add Corals Classification Dataset(Image)

I would like to add an image based dataset on Corals Classification

Chess Game Dataset (Lichess)

The Lichess Chess Game Dataset, 20,000+ Lichess Games, including moves, victor, rating, opening details and more taken from Kaggle.

NYPD Crime Complaint Data Historic (2006-2019) [Data Science]

This issue was created on DagsHub by:
plon-Susk7

I worked on this dataset for my Data Science course semester project. It has categorical, continuous and discrete features.

hate_speech_offensive Dataset

An annotated dataset for hate speech and offensive language detection on tweets.
The dataset can be found here: https://huggingface.co/datasets/hate_speech_offensive

Bollywood Age Dataset [Image dataset]

An Image dataset of Bollywood Actors for age classification and Indian face/CV detection, generation.

Tire Quality dataset

The dataset contains digital tire images, categorized into two classes: defective and good condition for quality inspection and classification. Taken from Kaggle.

imdb movie reviews Dataset

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.
the dataset can be found here: https://huggingface.co/datasets/imdb

tatsu-lab/alpaca HF text-generation dataset

This dataset can be found here: https://huggingface.co/datasets/tatsu-lab/alpaca

facial keypoint detection Image dataset

Dataset to detect location of landmarks on a face, taken from Kaggle

World Happiness Report 2021 Dataset

RAVDESS Emotional speech audio dataset

EA Sports FIFA dataset

The datasets provided include the players data for the Career Mode from FIFA 15 to EA Sports FC 24. The data allows multiple comparisons for the same players across the last 10 versions of the videogame. It is taken form kaggle.

Brain Tumor Image Dataset

Brain Tumor Classification (MRI) dataset taken from Kaggle.

FinFact Dataset

Fin-Fact is a comprehensive dataset designed specifically for financial fact-checking and explanation generation.
the dataset can be found here: https://huggingface.co/datasets/amanrangapur/Fin-Fact

Multi-Dimensional Gender Bias Classification Dataset

Dataset Details:
The Multi-Dimensional Gender Bias Classification dataset is constructed using a comprehensive framework that dissects gender bias within text across various pragmatic and semantic aspects. This includes bias related to the gender of the individual mentioned, the gender of the individual addressed, and the gender of the speaker. This dataset encompasses seven extensive datasets that have been automatically labeled with gender-related information (note that the HuggingFace distribution omits one dataset from the original project, which is the Wikipedia set). Additionally, it includes a crowdsourced evaluation benchmark for gender rewrites at the utterance level, a compilation of gendered names, and a catalog of gendered English words.

Dataset URL:
https://huggingface.co/datasets/md_gender_bias

CNN Dailymail Dataset

Dataset Details:
The CNN/DailyMail Dataset, composed of approximately 300,000 distinct news articles authored by journalists from CNN and the Daily Mail, primarily facilitates extractive and abstractive summarization. Nevertheless, its initial purpose was to aid machine reading and comprehension, as well as abstractive question answering.

Dataset URL:
https://huggingface.co/datasets/ccdv/cnn_dailymail

HF physics Dataset

Physics dataset is composed of 20K problem-solution pairs obtained using gpt-4.
the dataset can be found here: https://huggingface.co/datasets/camel-ai/physics