openbiolink / thoughtsource Goto Github PK

A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/

License: MIT License

Python 0.58% Jupyter Notebook 99.37% HTML 0.01% SCSS 0.01% TypeScript 0.04% JavaScript 0.01% Makefile 0.01%

dataset machine-learning natural-language-processing question-answering reasoning

thoughtsource's Introduction

OpenBioLink is a resource and evaluation framework for evaluating link prediction models on heterogeneous biomedical graph data. It contains benchmark datasets as well as tools for creating custom benchmarks and evaluating models.

Documentation

Paper preprint on arXiv • Peer reviewed paper in the journal Bioinformatics (for citations) • Supplementary data

The OpenBioLink benchmark aims to meet the following criteria:

Openly available
Large-scale
Wide coverage of current biomedical knowledge and entity types
Standardized, balanced train-test split
Open-source code for benchmark dataset generation
Open-source code for evaluation (independent of model)
Integrating and differentiating multiple types of biological entities and relations (i.e., formalized as a heterogeneous graph)
Minimized information leakage between train and test sets (e.g., avoid inclusion of trivially inferable relations in the test set)
Coverage of true negative relations, where available
Differentiating high-quality data from noisy, low-quality data
Differentiating benchmarks for directed and undirected graphs in order to be applicable to a wide variety of link prediction methods
Clearly defined release cycle with versions of the benchmark and public leaderboard

Benchmark Dataset

The OpenBioLink2020 Dataset is a highly challenging benchmark dataset containing over 5 million positive and negative edges. The test set does not contain trivially predictable, inverse edges from the training set and does contain all different edge types, to provide a more realistic edge prediction scenario.

OpenBioLink2020: directed, high quality is the default dataset that should be used for benchmarking purposes. To allow anayzing the effect of data quality as well as the directionality of the evaluation graph, four variants of OpenBioLink2020 are provided -- in directed and undirected setting, with and without quality cutoff.

Additionally, each graph is available in RDF N3 format (without train-validation-test splits).

OpenBioLink 2020 datasets

All datasets are hosted on zenodo.

OpenBioLink2020: directed, high quality // RDF (default dataset for benchmarking)
OpenBioLink2020: undirected, high quality // RDF
OpenBioLink2020: directed, no quality cutoff // RDF
OpenBioLink2020: undirected, no quality cutoff // RDF

Datasets summary

Dataset	Train	Test	Valid	Entities	Relations
directed, high quality	8.503.580	401.901	397.066	184.732	28
undirected, high quality	7.559.921	372.877	357.297	184.722	28
directed, no quality cutoff	51.636.927	2.079.139	2.474.921	486.998	32
undirected, no quality cutoff	41.383.093	2.010.662	1.932.436	486.998	32

Previous versions of the Benchmark (click to expand)

OpenBioLink 2020 alpha-release

Please note that the OpenBioLink benchmark files contain data derived from external ressources. Licensing terms of these external resources are detailed below.

Baseline results

	Model	MRR	h@1	h@10

Latent	RESCAL	.320	.212	.544
	TransE	.280	.175	.500
	DistMult	.300	.193	.521
	ComplEx	.319	.211	.547
	ConvE	.288	.186	.510
	RotatE	.286	.180	.511

Interpretable	AnyBURL (Maximum)	.277	.192	.457
	AnyBURL (Noisy-OR)	.159	.098	.295
	SAFRAN*	.306	.214	.501

Results are from (LinkExplorer: Predicting, explaining and exploring links in large biomedical knowledge graphs; Ott et al). Embedding approaches were trained using LibKGE. Best hyperparameters after extensive hyperparameter search can be found in the supplementary material of the before mentioned paper.

Installation

Pip

Install a pytorch version suitable for your system https://pytorch.org/
pip install openbiolink

Source

clone the git repository or download the project
Create a new python3.7, or python3.6 virtual environment (note: under Windows, only python3.6 will work) e.g.: python3 -m venv my_venv
activate the virtual environment
- windows: my_venv\Scrips\activate
  - linux/mac: source my_venv/bin/activate
Install a pytorch version suitable for your system https://pytorch.org/
Install the requirements stated in requirements.txt e.g. pip install -r requirements.txt

Manual

The OpenBioLink framework consists of three parts:

Graph creation
Dataset split
Evaluation

The creation of the graph and the splitting of the created graph in training, testing and an optional validation set can be performed by either via the GUI or the command line interface. The evaluation of a trained model is served as part of the openbiolink library.

Graph creation & Dataset split

GUI

By calling openbiolink from the command line a graphical user interface is started, providing an interface to create a graph and perform a dataset split. Step by step instructions on how to use the GUI can be found in the wiki.

Command line interface

openbiolink -p WORKING_DIR_PATH [-action] [--options] ...

Graph Creation

To generate the default graph (with all edges of all qualifies) in the current directory, use:

openbiolink generate

For a list of arguments, use:

openbiolink generate --help

Dataset Split

To split the default graph using the random scheme, use:

openbiolink split rand --edges graph_files/edges.csv --tn-edges graph_files/TN_edges.csv --nodes graph_files/nodes.csv

For a list of arguments, use:

openbiolink split rand --help

Splitting can also be done by time with

openbiolink split time

More documentation will be provided later.

Evaluation

To ensure a standardized evaluation of different methods applied to the OpenBioLink dataset, an evaluator is provided in the package openbiolink . For examples how to evaluate a model, see here.

Dataloader

All versions of the OpenBioLink datasets can be easily accessed within Python via the DataLoader, which downloads all required files automatically.

from openbiolink.evaluation.dataLoader import DataLoader

# Name of the Dataset, possible values HQ_DIR, HQ_UNDIR, ALL_DIR, ALL_UNDIR. Default: HQ_DIR
dl = DataLoader("HQ_DIR")

train = dl.training.mapped_triples
test = dl.testing.mapped_triples
valid = dl.validation.mapped_triples

File description

Graph Generation

TSV Writer

Default File Name	Description	Columns
ALL_nodes.csv	All nodes present in the graph	Node Id, Node type
edges.csv	All true positive edges	Node 1 ID, Edge type, Node 2 ID, Quality score, Source
edges_list.csv	List of edge types present in edges.csv	Edge type
nodes.csv	All nodes present in edges.csv	Node ID, Node type
nodes_list.csv	List of node types present in nodes.csv	Node type
TN_edges.csv	All true negative edges	Node 1 ID, Edge type, Node 2 ID, Quality score, Source
TN_edges_list.csv	List of edge types present in TN_edges.csv	Edge type
TN_nodes.csv	All nodes present in TN_edges.csv	Node ID, Node type
TN_nodes_list.csv	List of node types present in TN_nodes.csv	Node type
ids_no_mapping.tsv	ID's of nodes that could not be mapped to other ontology systems	Node ID, Node type
tn_ids_no_mapping.tsv	ID's of nodes that could not be mapped to other ontology systems	Node ID, Node type
stats.txt	Statistics about edges.csv and nodes.csv	(See column headers of file)
tn_stats.txt	Statistics about TN_edges.csv and TN_nodes.csv	(See column headers of file)

Biological Expression Language (BEL) Writer

The Biological Expression Language (BEL) is a domain specific language that enables the expression of biological relationships in a machine-readable format. It is supported by the PyBEL software ecosystem.

BEL can be exported with:

openbiolink generate --output-format BEL

Default File Name	Description
positive.bel.gz	All true positive edges in BEL Script format (gzipped) for usage in PyBEL or other BEL-aware applications)
positive.bel.nodelink.json.gz	All true positive edges in Nodelink JSON format (gzipped) for direct usage with PyBEL
negative.bel.gz	All true negative edges in BEL Script format (gzipped)
negative.bel.nodelink.json.gz	All true negative edges in Nodelink JSON format (gzipped)

Example opening BEL Script using pybel.from_bel_script():

import gzip
from pybel import from_bel_script
with gzip.open('positive.bel.gz') as file:
    graph = from_bel_script(file)

Example opening Nodelink JSON using pybel.from_nodelink_gz():

from pybel import from_nodelink_gz
graph = from_nodelink_gz('positive.bel.nodelink.json.gz')

There's an externally hosted copy of OpenBioLink here that contains the exports as BEL.

Train-test split creation

Default file name	Description	Column descriptions
train_sample.csv	All positive samples from the training set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
test_sample.csv	All positive samples from the test set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
val_sample.csv	All positive samples from the validation set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
negative_train_sample.csv	All negative samples from the training set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
negative_test_sample.csv	All negative samples from the test set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
negative_val_sample.csv	All negative samples from the validation set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
train_val_nodes.csv	All nodes present in the training and validation set combined	Node ID, Node type
test_nodes.csv	All nodes present in the test set	Node ID, Node typ
removed_test_nodes.csv	All nodes which got removed from the test set, due to not being present in the trainingset	Node ID
removed_val_nodes.csv	All nodes which got removed from the validation set, due to not being present in the trainingset	Node ID

CURIE's

All node ID's in the graph are CURIES, meaning entities can be easily looked up online by concatenating https://identifiers.org/ with the ID, f.e.:

CURIE	Identifiers.org
GO:0006915	https://identifiers.org/GO:0006915
REACTOME:R-HSA-201451	https://identifiers.org/REACTOME:R-HSA-201451

Detailed information of how the Identifiers are resolved can be found here https://registry.identifiers.org/

Train-test-split creation

Random split

In the random split setting, first, negative sampling is performed. Afterwards, the whole dataset (containing positive and negative examples) is split randomly according to the defined ratio. Finally, post-processing steps are performed to facilitate training and to avoid information leakage.

Time-slice split

In the time-slice split setting, for both of the provided time slices, first, negative sampling is performed. Afterwards, the first time slice (t-1 graph) is used as training sample, while the difference between the first and the second time slice serves as the test set. Finally, post-processing steps are performed to facilitate training and to avoid information leakage.

Generally, the time slice setting is trickier to implement than the random split strategy, as it requires more manual evaluation and knowledge of the data. One of the most difficult factors is the change of the source databases over time. For example, a database might change its quality score, or even its ID-format. Also, the number of relationships stored might increase sharply due to new mapping files being used. This might also result in ‘vanishing edges’, where edges that were present in the t-1 graph are no longer existent in the current graph. Although the OpenBioLink toolbox tries to assist the user with different kinds of warnings to identify such difficulties in the data, it is unfortunately not possible to automatically detect nor solve all these problems, making some manual pre- and post-processing of the data inevitable.

Negative sampling

First, the distribution of edges of different types is calculated to know how many samples are needed from each edge type. For now, this distribution corresponds to the original distribution (uniform distribution could a future extension). Then, subsamples are either – where possible – taken from existing true negative edges or are created using type-based sampling.

In type-based sampling, head and tail node are randomly sampled from a reduced pool of all nodes, which only includes nodes with types that are compatible with the corresponding head- or tail-role of the given relation type. E.g., for the relation type GENE_DRUG, one random node of type GENE is selected as head node and one random node of type DRUG is selected as tail.

In most cases where true negative edges exist, however, their number is smaller than the number of positive examples. In these cases, all true negative samples are used for the negative set, which is then extended by samples created by type-based sampling.

Train-test-set post-processing

To facilitate model application

Edges that contain nodes that are not present in the training set are dropped from the test set. This facilitates use of embedding-based models that usually cannot make predictions for nodes that have not been embedded during training.

Avoiding train-test information leakage and trivial predictions in the test set

Removal of reverse edges If the graph is directed, reverse edges are removed from the training set. The reason for this is that if the original edge a-b was undirected, both directions a→b and a←b are materialized in the directed graph. If one of these directed edges would be present in the training set and one in the test set, the prediction would be trivial. Therefore, in these cases, the reverse edges from the training set are removed. (Note that edges are removed from the training set instead of the test set because this is advantagous for maintaining the train-test-set ratio)
Removal of super-properties Some types of edges have sub-property characteristics, meaning that relationship x indicates a generic interaction between two entities (e.g. protein_interaction_protein), while relationship y further describes this relationship in more detail (e.g., protein_activation_protein). This means that the presence of x between two nodes does not imply the existence of a relation y between those same entities, but the presence of y necessarily implies the existence of x. These kinds of relationships could cause information leakage in the datasets, therefore super-relations of relations present in the training set are removed from the test set.

True Negative edges

As randomly sampled negative edges can produce noise or trivial examples, true negative edges (i.e., relationships that were explicitly mentioned to not exist) were used wherever possible. Specifically, for disease_drug and disease_phenotype edges, true negative examples were extracted from the data source directly, as they were explicitly stated. For gene-anatomy relationships, over-expression and under-expression data was used as contradicting data. For other relationship-types, e.g., gene_activation_gene and drug_inhibition_gene, this indirect true negative sample creation could not be applied, as the relationship does not hold all information necessary (the same substance can have both activating and inhibiting effects, e.g. depending on dosage).

Source databases and their licenses

Source type	Source name	License	True neg.	Score
edge (gene-gene)	STRING	CC BY	No	Yes
edge (gene-go)	GO	CC BY	No	Yes
edge (gene-disease)	DisGeNet	CC BY-NC-CA	No	Yes
edge (gene-phenotype)	HPO	Custom: HPO	No	No
edge (gene-anatomy)	Bgee	CC 0	Yes	Yes
edge (gene-drug)	STITCH	CC BY	No	Yes
edge (gene-pathway)	CTD	Custom: CTD	No	No
edge (disease-phenotype)	HPO	Custom: HPO	Yes	No
edge (disease-drug)	DrugCentral	CC BY-SA	Yes	No
edge (drug-phenotype)	SIDER	CC BY-NC-CA	No	No
ontology (genes)	GO	CC BY
ontology (diseases)	DO	CC 0
ontology (phenotype)	HPO	Custom: HPO
ontology (anatomy)	UBERON	CC BY
mapping (UMLS-DO)	DisGeNet	CC BY-NC-CA
mapping (STRING-NCBI)	STRING	CC BY
mapping (ENSEMBL/UNIPROT-NCBI)	UniProt	CC BY
id (genes)	NCBI	Public Domain
id (go)	GO	CC BY
id (anatomy)	UBERON	CC BY
id (disease)	DO	CC 0
id (drug)	PubChem	Public Domain
id (phenotype)	HPO	Custom: HPO
id (pathway)	REACTOME	CC BY
id (pathway)	KEGG	Custom: KEGG

(True neg.: whether the data contains true negative relations; Score: whether the data contains evidence quality scores for filtering relations)

The OpenBioLink benchmark files integrate data or identifiers from these sources. The provenance of data items is captured in the benchmark files, and licensing terms of source databases apply to these data items. Please mind these licensing terms when utilizing or redistributing the benchmark files or derivatives thereof.

All original data in the benchmark files created by the OpenBioLink project (not covered by the licenses of external data sources) are released as CC 0.

We offer the benchmark files as-is and make no representations or warranties of any kind concerning the benchmark files, express, implied, statutory or otherwise, including without limitation warranties of title, merchantability, fitness for a particular purpose, non infringement, or the absence of latent or other defects, accuracy, or the present or absence of errors, whether or not discoverable, all to the greatest extent permissible under applicable law.

2021 Mapping of relations to external datasets

To aid with comparing the contents of OpenBioLink with other external knowledge graphs, we created an extensive mapping of relations. It is available as Google sheet here.

Citation

@article{10.1093/bioinformatics/btaa274,
    author = {Breit, Anna and Ott, Simon and Agibetov, Asan and Samwald, Matthias},
    title = "{OpenBioLink: a benchmarking framework for large-scale biomedical link prediction}",
    journal = {Bioinformatics},
    volume = {36},
    number = {13},
    pages = {4097-4098},
    year = {2020},
    month = {04},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa274},
    url = {https://doi.org/10.1093/bioinformatics/btaa274},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/36/13/4097/33458979/btaa274.pdf},
}

This project received funding from netidee.

thoughtsource's People

Contributors

Stargazers

Watchers

thoughtsource's Issues

Create first interactive demo for at least one task

change saving of default template

When items are created, we do not have to save the template every time. We can just define it as 'default' or standard and save it somewhere, e.g. in the fragments.
In that way we can reduce the file size of the json outputs.

Update primary Readme page so it introduces project to broader public

loading collections and generated_cots accept single strings, not only lists

right way:
collection = Collection(["worldtree"])

this should also work, or at least throw an error
collection = Collection("worldtree")

the same goes for loading pregenerated cots, e.g.:
collection = Collection(..., load_pregenerated_cots="lievin")

Create motivating demos of CoT streams

Create non-trivial, motivating demos of CoT streams for outreach and tool testing.

This should also include 1-3 examples of biomedical CoT streams with a few re-starts for each input to serve as a basis for the work of @Llewi

throw error if from_json is used on already loaded collection

this is right: collection = Collection.from_json("...")

this should throw an error in the second line:
collection = Collection([dataset])
collection.from_json("...")

Dataset: AQuA

Homepage: https://github.com/deepmind/AQuA
Example:

{
"question": "A grocery sells a bag of ice for $1.25, and makes 20% profit. If it sells 500 bags of ice, how much total profit does it make?",
"options": ["A)125", "B)150", "C)225", "D)250", "E)275"],
"rationale": "Profit per bag = 1.25 * 0.20 = 0.25\nTotal profit = 500 * 0.25 = 125\nAnswer is A.",
"correct": "A"
}

Add "_annotated" subscript to downloaded file from annotator

Dataset: StrategyQA

Contains explanations useful for CoT .
Steps are implicit in the question and should be inferred using a strategy. StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.

Figure: Questions in STRATEGYQA (Q1) require implicit decomposition into reasoning steps (D), for which we annotate supporting evidence from Wikipedia (E). This is in contrast to multi-step questions that explicitly specify the reasoning process (Q2).

Tutorial notebook does not open

I get the following error:
Invalid Notebook
Additional properties are not allowed ('id' was unexpected)
Using nbformat v5.8.0 and nbconvert v7.2.7

Dataset: CommonsenseQA

The CommonsenseQA dataset with CoTs derives from explanations from the ECQA dataset.

Potential todos for post-processing the dataset to improve quality:

First letter in each CoT entry is not consistently capitalized, this could be fixed easily

There is a substantial number of typos and grammatical errors that could be corrected. This is of course difficult given the vast size of the dataset. Maybe it could at least in part be done (semi-)automatically?

Dataset: GSM8K

Math CoT.
Contains 8.5K high quality linguistically diverse grade school math word problems that can be used for evaluating the ability of language models in multi-step mathematical reasoning. Every problem takes between 2 and 8 steps to solve, and a solution involves performing a sequence of simple calculations using basic arithmetic operations to reach the final answer.

Figure: Three example problems from GSM8K. Calculation annotations are highlighted in red

save generated cots in case of error

The generate function saves the generated cots, when it finished it for all examples. (see script of data-loader, which calls the function)
If any error happens, all the generated cots from before are not saved and are therefor lost.

Datasets: Datasets from Wei2022 repository (aqua, asdiv, commonsenseqa, date_understanding, gsm, mapwps sports_understanding, strategy_qa, svamp)

Observations for asdiv :

The formatting of the question text could be improved through post-processing (there are many superfluous spaces; capitalization). I guess this was carried over from the source?
For CoTs with a single line, the "First, [...] prefix could be removed.

Rename templates.json

The name templates.json is rather misleading, as we are not storing templates, rather fragments of texts. @matthias-samwald, @KonstantinHebenstreit are you fine with renaming it to fragments.json and the field templates_version to fragments_version?

Generate takes very long time to finish

I have created the following application:

from cot import Collection
import os

os.environ["OPENAI_API_KEY"] = "<my_api_key>"


dataset = Collection(["med_qa"])
config = {
    "instruction_keys": ['qa-01'],
    "cot_trigger_keys": ['kojima-01'],
    "answer_extraction_keys": ['kojima-A-D'],
    "api_service": "openai",
    "engine": "text-davinci-003",
    "temperature": 0.35,
    "max_tokens": 512,
    "verbose": False,
    "warn": True
}

dataset_subset = dataset.select(split="train", number_samples=20, random_samples = True, seed = 0)

dataset_subset.generate(config = config)

Running the application, however, takes a very long time to finish (>> 30 minutes). During this time it progresses through the dataset, but each step takes a long time to finish and it seems like the application lags out at some point and doesn't continue generating.

After a while, the following message/warning appears:

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Fri, 26 May 2023 13:26:41 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7cd652355b51c2c3-VIE', 'alt-svc': 'h3=":443"; ma=86400'}.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Fri, 26 May 2023 13:31:56 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7cd65a493845c2c3-VIE', 'alt-svc': 'h3=":443"; ma=86400'}.

Annotator does not show correct option for boolean dataset.

Annotator should show correct option, true or false or yes or no. Like in multiple choice datasets, with the correct one in bold.

Grant application writing

medmc_qa dataset does not contain generated CoTs

Looking at both the dataset viewer and a local dump I generated, it seems like the medmc_qa dataset does not contain generated CoTs?

Integrate context into template field possibilities

make it possible to use the field context in our prompts.

Check that contributor's guide is still up-to-date

Please quickly check that the contrib guide is still up-to-date after recent changes (at first glance I don't see anything that I expect to be affected)
https://github.com/OpenBioLink/ThoughtSource/blob/main/CONTRIBUTING.md

Dataset: OpenBookQA

Project page
Open-book science QA
Single facts as explanations

PubmedQA dataset: Add reference CoT based on LONG_ANSWER field in source

We can use these long answers as reference CoTs; currently there are no reference CoTs defined for the dataset.
#68

Error during pip install

I got this error when doing pip install (in a fresh conda environment). Maybe a syntax error in one of the config files?

Default values for str if empty: None or ""

We should settle on a default value for string field if they are null: Should we assign these fields the values None or ""?

Collect prompts used to generate CoTs

Prompts can be collected in a markdown file, for a start

Dataset: QED

Project page Paper

Reorganize Github repo to accomodate multiple subcomponents

The current repo is focused on the data loaders. We want to also put other components into the repo, e.g. the App(s) from @Llewi or analysis notebooks. I guess we can simply move each component to a subfolder.

Should we have any specific tooling / workflows / CI to ensure coherence and compatibility of subcomponents (e.g. as described in https://monorepo.tools/)? Certainly any solution should be very lightweight.

`document_id` is redundant

As discussed in person with @matthias-samwald, we now have three ids in one sample that is currently the same for all examples in all datasets id, document_id and question_id.

We agreed upon removing question_id, since it is too similar to id and as a rule is the same as question_id.
document_id is for referencing external documents, texts, ... This however should not have a value, unless it actually references an external object.

Add datasets to Hugging Face dataset hub

We might tap into experience of the BigBIO group for this process.

Add markdown page describing added datasets (transfer from Google Doc)

Dataset: WorldTree

Dataset: EntailmentBank

Large set of 1,840 expert-authored tree-structured explanations to science exam questions
Paper, Dataset, PDF containing graphs of all entailment trees

Create script that generates an overview of our converted datasets and their contents

This is obviously still a bit underspecified, but we should be able to provide insights not only on particular data items and model predictions, but also datasets as a whole.

Once we have aggregated some HF datasets in the standardized ThoughtSource format, we need a script that iterates through the datasets and provides simple descriptions for each,e.g.,

Number of dataset items
For each dataset feature, number of items where this feature is non-empty (to see which fields are filled in which datasets, many slots are optional)
For features that are arrays: min/max/median array length?
Anything more advanced? (probably not)

Compile license info for all included datasets

Add field 'generated_answer' to schema

One limitation of the current schema that occured to me: we don't have a dedicated field for differentiating "correct answer" from "generated answer".
This is not yet a problem, but will become a problem once we also have datasets that capture (potentially wrong) answers from models an the follow-up feedback and correction etc.

I suggest we leave the field "answer" for the gold-standard answer, and add a field "generated_answer" for this future use-case. I hope the schema will be quite versatile and stable then.

Annotator does not allow unselecting annotations

krippendorff scores

do not correct for negative values when bootstrapping.

Build tests sometimes fail because of "too many requests" HTTP error in our remote sources

link to Annotator example file broken

Hello. The example file link on main readme page is broken: https://github.com/OpenBioLink/ThoughtSource/blob/main/notebooks/worldtree_10.json

In addition, loading that file (worldtree_10.json) found elsewhere on the repo into the online Annotator demo generates an error (file not in correct format). Thanks,

from_json seems not to work for empty datasets

If I dump a collection, in which one of the datasets has zero entries, The from_json will throw an error.

evaluation for duplicated answer choices

In datasets are sometimes examples with 4 or 5 answer choices. I think what has been done is just to duplicate one of the answer choices to always have 5 choices.
The evaluation script does not include this option.
Since we put letters in front of the choices (A,B,C,D,E), the model can also answer with a letter. But if the right choice it as two places it has two letters. This can lead to wrong evaluation scores based on the letters.

First example is commonsense_qa, but there might be others.

rename "cot" in config to "generated_cot"

If we want to use the human "cot" we cannot call the AI "generated_cot" only "cot". That will lead to confusion.

Annotator sort by dataset

The annotator does not sort the examples by dataset.
This only becomes a problem when you load in a collection which incorporates examples from multiple datasets.
In that case it will just sort all of the examples by their id, no matter which dataset they belong to.

Annotator "star" feature does not work properly

The feature with the star (key: "preferred") does not really work . (The primary I think the primary problem is, that it saves values as bool, when the have to be saved as string. Apart from the minor problem that you cannot "unclick" a star, if you clicked it)

The error is that you cannot load the data back into ThoughtSource (the "normal" cot lib we always use) after annotating it.

If you already have that problem here is the quick fix:
You can replace the values in the json before loading it in ThoughtSource:
replace
"value": false
by
"value": "false"
and
"value": true
by
"value": "true"

Then you can load it. The annotator can afterwards still read the file too, but the star annotation is not shown.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble