google-research-datasets Goto Github PK

Name: Google Research Datasets

Type: Organization

Bio: Datasets released by Google Research

Location: Mountain View, CA

Google Research Datasets's Projects

This dataset contains about 110k images annotated with the depth and occlusion relationships between arbitrary objects. It enables research on the 2.5D Visual Relationship Detection (2.5VRD) introduced in https://arxiv.org/abs/2104.12727.

aart-ai-safety-dataset

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

agreesum

Article-summary entailment annotations for agreement-oriented multidoc summarization

ais

AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable to Identified Sources".

altentities

AltEntities (Alternative-Entities) is a dataset of English-language alternative questions with their answers. In these questions, a user is presented with a choice of two entities. The user answer intends to select one entity from the pair by referring to it in an indirect way, i.e., without referring to its name or position.

answer-equivalence-dataset

This dataset contains human judgements about answer equivalence. The data is based on SQuAD (Stanford Question Answering Dataset), and contains 9k human judgements of answer candidates generated by Albert on the SQuAD train set, and an additional 14k human judgements for answer candidates produced by BiDAF, Luke, and XLNet on the SQuAD dev set.

aquamuse

AQuaMuSe is a novel scalable approach to automatically mine dual query based multi-document summarization datasets for extractive and abstractive summaries using question answering dataset (Google Natural Questions) and large document corpora (Common Crawl)

attributed-qa

We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in information-seeking scenarios. This release consists of human-rated system outputs for a new question-answering task, Attributed Question Answering (AQA).

bam

birds-to-words

boolean-questions

c4_200m-synthetic-dataset-for-grammatical-error-correction

This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

c4repset

C4RepSet: Representative Subset from C4 data for Training Pre-trained LMs

cats4ml-dataset

This dataset is a result of the CATS4ML (Crowdsourcing Adverse Test Sets for Machine Learning) Data Challenge - an adversarial test-set sampling images and labels from the Open Images Dataset for state-of-the-art image classification models. The challenge invited participants to sample this existing publicly available dataset for images that are incorrectly classified by image classification models. It was announced at the HCOMP 2020 conference and ran for three months (Jan-Apr 2021) aiming at submissions by researchers and developers worldwide. This challenge is a first proof-of-concept for the approach using an existing AI dataset, and shows immediate positive impact on improving evaluation datasets in AI research. In the following subsections we describe the main components of the challenge pipeline and data used.

ccpe

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'. The 'assistant' elicits the 'user’s' preferences about movies following a Coached Conversational Preference Elicitation (CCPE) method. The assistant asks questions designed to minimize the bias in the terminology the 'user' employs to convey his or her preferences as much as possible, and to obtain these preferences in natural language. Each dialog is annotated with entity mentions, preferences expressed about entities, descriptions of entities provided, and other statements of entities.

chimera-painter-dataset

A rendered creature image dataset with part segmentation inputs for machine learning and neural rendering.

circa

Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions. The dataset contains pairs of yes/no questions and indirect answers, together with annotations for the interpretation of the answer. The data is collected in 10 different social conversation situations (eg. food preferences of a friend).

clang8

cLang-8 is a dataset for grammatical error correction.

clay

The dataset includes UI object type labels (e.g., BUTTON, IMAGE, CHECKBOX) that describes the semantic type of an UI object on Android app screenshots. It is used for training and evaluation of the screen layout denoising models (paper will be linked soon).

clse

The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.

coarse-discourse

A large corpus of discourse annotations and relations on ~10K forum threads.

common-crawl-domain-names

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").

comparative-question-completion

The comparative question completion is a dataset consisting of real user comparative questions gathered from question-answering websites. Each question compares between two entities, in one of three domains (animals, NBA players, and locations). In each example one of the two entities is masked, and the goal is to find the masked entity in each example.

conceptual-12m

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

contrack

The Context Tracking dataset consists of annotated human-human social conversations. Each conversation contains annotations for the people and location entities mentioned, their properties and the relationships between them. The annotated data enables several subtasks like slot tagging, coreference resolution, resolving plural mentions and entity linking.

cpcd

The Conversational Playlist Creation Dataset (CPCD) contains 917 conversations between two people where users express preferences over sets of songs in natural language and wizards to elicit preferences from users. The dataset includes per-song ratings and can be used to design and evaluate conversational recommendation systems.

crisscrossed-captions

Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

google-research-datasets Goto Github PK

Google Research Datasets's Projects

Recommend Projects

Recommend Topics

Recommend Org

Jobs