GithubHelp home page GithubHelp logo

oasisresearchlab / context24 Goto Github PK

View Code? Open in Web Editor NEW
18.0 4.0 1.0 950.57 MB

Dataset repository for SDPROC SHared Task: Context24: Contextualizing Scientific Figures and Tables

License: MIT License

Python 100.00%

context24's Introduction

Context24: Dataset repository for SDPROC Shared Task: Context24: Contextualizing Scientific Figures and Tables

This repository hosts the training/dev datasets and evaluation scripts for the 2024 Workshop on Scholarly Document Processing Shared Task: Context24: Contextualizing Scientific Figures and Tables

Submissions for the shared task will be evaluated on the eval.ai platform at this challenge URL: https://eval.ai/web/challenges/challenge-page/2306/overview (with a challenge period of May 31 10 am PST - June 6 5 pm PST).

Background and Problem

People read and use scientific claims both within the scientific process (e.g., in literature reviews, problem formulation, making sense of conflicting data) and outside of science (e.g., evidence-informed deliberation). When doing so, it is critical to contextualize claims with key supporting empirical evidence (e.g., figures of key results) and methodological details (e.g., measures, sample).

However, retrieving this contextual information when encountering and using claims in the moment (often far removed from the source materials and data) is difficult and time-consuming. Can we train AI models to help with this?

To assist with development of such models, this dataset containts 474 examples of scientific claims actually in use in lab notes and discussions for synthesis and research planning, across domains of biology, computer science, and the social sciences. For all of these claims, the dataset includes “gold” annotations for figures/tables that ground the key results behind each claim (this is “Task 1” for the workshop: see below). For a subset of these claims, we have “gold” examples of text snippets that describe the key methodological details that ground each claim (this is “Task 2” for the workshop: see below).

Dataset and directory structure

The claims and papers from this task come from four separate datasets, each of which comes from a different set of research domains.

  1. akamatsulab: Cell biology
  2. BIOL403: Cell biology
  3. dg-social-media-polarization: Social sciences (political science, economics, HCI)
  4. megacoglab: Various (HCI, psychology, economics, CS, public health)

The directory structure is as follows:

task1-train-dev.json
task1-train-dev-2024-04-25-update.json
task2-train-dev.json
full_texts.json
full_texts-2024-04-25-update.json
figures-tables/
	citekey/
		FIG 1.png
		...
silver-data/
eval/
extracted_captions/
	citekey1.json
	...
task1-test.json
task2-test.json
test_figures-tables/
test_extracted_captions/
full_texts-test.json

The main training/dev datasets are in task1-train-dev.json and task2-train-dev.json.

Parsed figures, tables, and captions for each paper, as .png files, are in figures-tables, organized by paper citekey as enclosing subfolders.

Caption texts extracted by running OCR via Nougat on corresponding caption .png files from figures-tables are provided in extracted_captions, organized into .json files per paper citekey. Since these are automatically extracted, some might be empty/incorrect due to extraction failures.

The current set of full-text parses for each paper are in fulltexts-2024-04-25-update.json. The json is structured as a dictionary, where each key is a citekey (e.g., nomura2004human) and with a string containing the whole full-text parse for that paper as its associated value.

Evaluation scripts for each task are in eval/

As an additional possibly useful resource, silver-data contains full text parses for 17,007 papers from 1-2 hop in-bound and out-bound citations of the focal papers.

Test set claims are in task1-test.json and task2-test.json, respectively. .png extracts of figures, tables, and captions for corresponding citekeys for the claims are in test_figures-tables/ and captions are in extracted_captions/, with the same organizational structure as the training data. Full-texts are in full_texts-test.json

Task 1: Evidence Identification

Task description

Given a scientific claim and a relevant research paper, predict a shortlist of key figures or tables from the paper that provide supporting evidence for the claim.

Here is an example claim with a Figure as its key supporting evidence.

{
	"id": "akamatsulab-WJvOy9Exn",
	"claim": "The density of free barbed ends increased as a function of growth stress",
	"citekey": "li2022molecular",
	"dataset": "akamatsulab",
	"findings": [
		"FIG 1D"
	]
}

And an example claim with a Figure and Table as its key supporting evidence.

{
	"id": "dg-social-media-polarization-tQy4sEF_R",	
	"claim": "Perceived polarization increased as a function of time spent reading a tweet, but only for Republican users",
	"citekey": "banksPolarizedFeedsThreeExperiments2021",
	"dataset": "dg-social-media-polarization",
	"findings": [
		"FIG 6",
		"TAB 2"
	]
}

Each claim corresponds to a paper via the citekey field. The figures, tables, and captions for that paper can be found in the figures-tables/ , under the subfolder with the same name as the citekey. The figures, tables, and captions are a set of .png files. Caption texts extracted via OCR can be found in a .json file with the same name as the citekey in extracted_captions/.

Scoring will be done using NDCG at 5 and 10. More details in eval1.py in eval/.

Note

Many figures are compound figures, with labeled subfigures (e.g., FIG 1A, FIG 1B). Sometimes the relevant grounding figure is a subfigure, and sometimes it is the whole (parent) figure. We therefore provide figure parses for each parent figure as well as its subfigures (e.g., we provide both FIG 1 and FIG 1A, FIG 1B). And, accordingly, for NDCG scoring, predicted images that are parent/sub-figures of a gold figure label receive a relevance score of 0.5.

Training and dev data description

There are currently 474 total scientific claims across the four datsets, in the following breakdown

Dataset N
akamatsulab 213
BIOL403 60
dg-social-media-polarization 78
megacoglab 123

393 were present in the initial release (in task1-train-dev.json), and 81 new claims were added on April 26. The full training dataset of 474 claims are in task1-train-dev-2024-04-25.json.

Test data description

The test set consists of 111 total scientific claims across two datasets, in the following breakdown

Dataset N
akamatsulab 51
megacoglab 60

Task 2: Grounding Context Identification

Task description

Given a scientific claim and a relevant research paper, identify all grounding context from the paper discussing methodological details of the experiment that resulted in this claim. For the purposes of this task, grounding context is restricted to quotesa from the paper. These grounding context quotes are typically dispersed throughout the full-text, often far from where the supporting evidence is presented.

For maximal coverage for this task, search for text snippets that cover the following key aspects of the empirical methods of the claim:

  1. What observable measures/data were collected
  2. How (with what methods, analyses, etc.) from
  3. Who(m) (which participants, what dataset, what population, etc.)

NOTE: we will not be scoring the snippets separately by context "category" (e.g. who/how/what): we provide them here to clarify the requirements of the task.

Here is an example claim with a quotes as empirical methods context.

{
    "id": "megacoglab-W3sdOb60i",
    "claim": "US patents filed by inventors who were new to the patent's field tended to be more novel",
    "citekey": "artsParadiseNoveltyLoss2018a",
    "dataset": "megacoglab",
    "context": [
        "To assess patent novelty, we calculate new combinations (ln) as the logarithmic transformation of one plus the number of pairwise subclass combinations of a patent that appear for the first time in the US. patent database (Fleming et al. 2007, Jung and Jeongsik 2016). To do so, each pairwise combination of subclasses is compared with all pairwise combinations of all prior U.S. patents. (p. 5)",
        "we begin with the full population of inventors and collect all patents assigned to \ufb01rms but, by design, must restrict the sample to inventors who have at least two patents assigned to the same \ufb01rm. The advantage of this panel setup is that we can use inventor\u2013firm fixed effect models to control for unobserved heterogeneity among inventors and firms, which arguably have a strong effect on the novelty and value of creative output. This approach basically uses repeated patents of the same inventor within the same firm to identify whether the inventor creates more or less novel\u2014and more or less valuable\u2014patents when any subsequent patent is categorized in a new \ufb01eld. The sample includes 2,705,431 patent\u2013inventor observations assigned to 396,336 unique inventors and 46,880 unique firms, accounting for 473,419 unique inventor\u2013firm pairs. (p. 5)",
        "For each inventor-patent observation, we retrieve the three-digit technology classes of all prior patents of the focal inventor and identify whether there is any overlap between the three-digit technology classes of the focal patent and the three-digit technology classes linked o all prior patents of the same inventor. We rely on all classes assigned to a patent rather than just the primary class. Exploring new fields is a binary indicator that equals one in the absence of any overlapping class between all prior patents and the focal patent. (p. 6)",
        "we can use inventor\u2013\ufb01rm \ufb01xed effect models to control for unobserved heterogeneity among inventors and \ufb01rms, which arguably have a strong effect on the novelty and value of creative output (p. 5)",
        "we select the full population of inventors with U.S. patents assigned to \ufb01rms for 1975\u20132002 (p. 3)"
    ]
  },

In this example, the quotes fall into the following aspects of empirical methods:

What:

"To assess patent novelty, we calculate new combinations (ln) as the logarithmic transformation of one plus the number of pairwise subclass combinations of a patent that appear for the first time in the US. patent database (Fleming et al. 2007, Jung and Jeongsik 2016). To do so, each pairwise combination of subclasses is compared with all pairwise combinations of all prior U.S. patents. (p. 5)"

"For each inventor-patent observation, we retrieve the three-digit technology classes of all prior patents of the focal inventor and identify whether there is any overlap between the three-digit technology classes of the focal patent and the three-digit technology classes linked o all prior patents of the same inventor. We rely on all classes assigned to a patent rather than just the primary class. Exploring new fields is a binary indicator that equals one in the absence of any overlapping class between all prior patents and the focal patent. (p. 6)"

Who:

"we select the full population of inventors with U.S. patents assigned to \ufb01rms for 1975\u20132002 (p. 3)"

How:

"we begin with the full population of inventors and collect all patents assigned to \ufb01rms but, by design, must restrict the sample to inventors who have at least two patents assigned to the same \ufb01rm. The advantage of this panel setup is that we can use inventor\u2013firm fixed effect models to control for unobserved heterogeneity among inventors and firms, which arguably have a strong effect on the novelty and value of creative output. This approach basically uses repeated patents of the same inventor within the same firm to identify whether the inventor creates more or less novel\u2014and more or less valuable\u2014patents when any subsequent patent is categorized in a new \ufb01eld. The sample includes 2,705,431 patent\u2013inventor observations assigned to 396,336 unique inventors and 46,880 unique firms, accounting for 473,419 unique inventor\u2013firm pairs. (p. 5)"

"we can use inventor\u2013\ufb01rm \ufb01xed effect models to control for unobserved heterogeneity among inventors and \ufb01rms, which arguably have a strong effect on the novelty and value of creative output (p. 5)"

Scoring will be done using ROUGE and BERT score similarity to the gold standard quotes. See eval2.py in eval/ for more details.

Example test data

Task 2 is a "test-only" task. In liueu of training data, we are releasing a small (N=42) set of examples, which can be used to get an idea for the task, with the following breakdown across the akamatsulab and megacoglab datasets:

Dataset N
akamatsulab 28
megacoglab 14

Test data description

The test set consists of 109 total scientific claims across two datasets, in the following breakdown

Dataset N
akamatsulab 49
megacoglab 60

Evaluation and Submission

You can see how we will evaluate submissions --- both in terms of scoring, and prediction file format and structure --- for Task 1 and 2 by running the appropriate eval script for your predictions.

Submissions will be evaluated on the eval.ai platform at this challenge URL: https://eval.ai/web/challenges/challenge-page/2306/overview

The challenge is currently not yet live (pending some technical issues, should be up in the next few days), but submissions will be accepted in the same format as expected by the eval scripts.

Task 1

Predictions for this task should be in a .csv file with two columns:

  1. The claim id (e.g., megacoglab-W3sdOb60i)
  2. The predicted figure/table ranking, which will be comma-separated string of figure/table names, from highest to lowest ranking

Example:

claimid,predictions
megacoglab-W3sdOb60i,"FIG 1, TAB 1"

Warning

The script expects a header row, so make sure your csv has a header row, otherwise the first row of your predictions will be skipped. The names in the header row do not matter, because but we don't use the header names to parse the predictions data.

To get scores for your predictions, inside the eval/ subdirectory, run task1_eval.py as follows:

python task1_eval.py --pred_file <path/to/predictionfilename>.csv --gold_file ../task1-train-dev.json --parse_folder ../figures-tables

You can optionally add --debug True if you want to dump scores for individual predicions for debugging/analysis.

Task 2

Predictions for this task should be in a .json file (similar in structure to the training-dev file) where each entry has the following fields:

  1. id (id of the claim)
  2. context (list of predicted snippets: order is not important)

Before running the eval script for task 2, you will need to first install required dependencies ofbert-score and rouge-score.

bert-score: https://github.com/Tiiiger/bert_score

pip install bert-score

rouge-score:

pip install rouge-score

Then run the task2_eval.py script in the following format:

python task1_eval.py --pred_file <path/to/predictionfilename>.json --gold_file ../task1-train-dev.json --parse_folder ../figures-tables

context24's People

Contributors

joelchan avatar aakanksha19 avatar

Stargazers

chongz-suda avatar Andrew Head avatar Zaki Mughal [sivoais] avatar YY Teng avatar Matt Spike avatar Buse Sibel Korkmaz avatar Roelien Timmer avatar Raia Abu Ahmad avatar  avatar  avatar Michael Gartner avatar Peratham avatar  avatar Doug Downey avatar Jonny Dubowsky avatar YuanChang avatar  avatar ProgrammerUnknown avatar

Watchers

 avatar  avatar Matthew Akamatsu avatar Roelien Timmer avatar

Forkers

iamshriharshan

context24's Issues

Test data release

When will you release the test data and the platform to submit the result?
While the organizer announced "Test set release: May 24th (Friday), 2024," I cannot find that on the SDProc website or in this repository.

Some figures are missing

For your information, some figures seem missing from the figures-tables directory.

akamatsulab-Whvj4Tjxl - FIG 2A not found.
akamatsulab-Whvj4Tjxl - FIG 2B not found.
akamatsulab-Whvj4Tjxl - FIG 2C not found.
akamatsulab-Whvj4Tjxl - FIG 2D not found.
akamatsulab-Whvj4Tjxl - FIG 2E not found.
akamatsulab-Whvj4Tjxl - FIG 2F not found.
akamatsulab-Whvj4Tjxl - FIG 2G not found.
akamatsulab-B7UPmuzgg - FIG 5.4A not found.
dg-social-media-polarization-AQy-nbEwf - FIG 2 not found.

Ranking on EvalAI leaderboard

Hi!

It seems that the ranking on the leaderboard for Task 1 is based on the logic "the lower nDCG@5 and nDCG@10 scores, the better". Should not it be the opposite? Could you please elaborate a bit on the evaluation strategy?

Thanks in advance!

Report/paper submission

Hello 😊

Could you please elaborate on the paper/report that we can submit as participants in this shared task?

More specifically, we were wondering whether the same submission information for the general SDP workshop apply in terms of using the ACL template, and if there are any restrictions in terms of submitting a long vs. short paper.
Also, will these papers be published in the workshop proceedings?

Thanks a lot!

Origin Paper pdf

Since my graduation design is to evaluate the relut of the top 3 teams and create a new method by myself. But the quality of the figure or table png file are not good, do you have the origin Paper pdf file?

Task 1: Prediction file

Hello!

I am tackling the task as a binary classification problem. I am not sure whether the final prediction file should also contain rows with claim ids which have empty predictions (none of the figs/tabs were predicted as providing evidence, i.e., assigned class 0). So far, I generated files containing only claims with predictions (fig/tabs providing evidence, i.e., assigned class 1).

Are such empty rows considered during evaluation or is it fine to exclude them?

Thanks!

Edited: I have some concerns about the scores obtained. For some reason, in case when my prediction file contains only 8 rows (excluding headers) with the predictions (no rows with empty predictions), the scores are considerably higher compared to cases when the file has 30 and more rows. Even if the predictions in the second scenario are worse, I would still expect the scores for the first case to be quite low and without such a large difference compared to the other.

Cannot find the training data

Hi,

Thanks for holding this interesting task.

I carefully read this repo, but cannot find the entry to download the training data.

Is this dataset released?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.