GithubHelp home page GithubHelp logo

boudinfl / acm-cr Goto Github PK

View Code? Open in Web Editor NEW
8.0 5.0 1.0 1.72 GB

ACM-CR: A Manually Annotated Test Collection for Citation Recommendation

License: The Unlicense

TeX 99.99% Python 0.01% Shell 0.01%
citation-recommendation test-collection information-retrieval evaluation-data

acm-cr's Introduction

ACM-CR: A Manually Annotated Test Collection for Citation Recommendation

This repository contains the test collection for (context-aware) citation recommendation constructed from bibliographic records and open-access papers collected from the ACM Digital Library.

Requirements

Installing required Python modules

pip install -r requirements.txt 

Test collection

Document collection

Documents are bibliographic records of scientific papers (BibTeX entries) about IR-related topics collected from the ACM Digital Library. We use the SIGs IR, KDD, CHI, WEB and MOD sponsored conferences and journals as filter. BibTeX files are grouped by venue-year (e.g. 'sigir/sigir-1999' contains all citations from SIGIR-1999). Details about the venues are presented in here. An example of document is given below:

@inproceedings{10.1145/3383583.3398517,
  author = {Gallina, Ygor and Boudin, Florian and Daille, B\'{e}atrice},
  title = {Large-Scale Evaluation of Keyphrase Extraction Models},
  year = {2020},
  isbn = {9781450375856},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3383583.3398517},
  doi = {10.1145/3383583.3398517},
  abstract = {Keyphrase extraction models are usually evaluated under different, not directly comparable, experimental setups. [...]},
  booktitle = {Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020},
  pages = {271โ€“278},
  numpages = {8},
  keywords = {natural language processing, keyphrase generation, evaluation},
  location = {Virtual Event, China},
  series = {JCDL '20}
}

Some statistics of the document collection (2021-04-30): 114,882 records, from which 103,990 (91%) have abstracts and 83,517 have author-assigned keyphrases (73%). We only consider 'article' and 'inproceedings' citations, and remove session papers (title starting with "Session" and empty abstract)

Queries (citation contexts) and relevance judgments

Papers used for generating queries are in the data/topics+qrels directory. They are grouped by venue (e.g. sigir-2020), and each of them is represented by three separate files, e.g. for paper with doi 10.1145/3397271.3401032:

3397271.3401032.pdf   # pdf of the paper

3397271.3401032.dois  # manually curated list of dois for cited references
                      # the following rules are used for mapping dois :
                      #   1. DOI from the ACM DL
                      #   2. DOI from another publisher (including ACL-anthology DOIs)
                      #   3. arxiv/pubmed/acl-anthology url
                      #   4. pdf url
                      #   5. None

3397271.3401032.xml   # manually extracted citation contexts

Actually, there are 50 papers (the list of selected papers is data/topics+qrels/papers/list.md) and their manually extracted and curated citation contexts are in the following xml format:

<doc>
  <doi>10.1145/3397271.3401032</doi>
  <title>Measuring Recommendation [...]</title>
  <abstract>Explanations have a large effect on [...]</abstract>
  <contexts>
  <contexts>
    <context id="01" section="introduction">
      <s>Recommendations are part of everyday life.</s>
      <s>Be they made by a person, or by [...]</s>
      <s cites="13,14,23,28">Explanations are known to strongly impact how the recipient of a recommendation responds [13, 14, 23, 28], yet the effect is still not well understood.</s>
    </context>
    [...]
  </contexts>
  <references>
    <reference id="1">10.1145/3173574.3174156</reference>
    <reference id="2">10.1145/3331184.3331211</reference>
    [...]
  </references>
</doc>

Some statistics of the manually extracted citations and relevance judgements

python3 src/cc_stats.py --input data/topics+qrels/papers/ \
                        --collection data/topics+qrels/collection.txt

avg number of cited documents: 31.82 [8 - 71]
avg number of cited documents in collection: 15.86 [1 - 37]
avg coverage of collection: 0.5074 [0.0909 - 0.8333]
  • number of citation contexts (s): 837
  • number of 1+/all citation contexts (s): 552
0 1+ All
34.05 20.91 45.04
  • number of citation contexts (p): 341
  • number of 1+/all citation contexts (p): 269
0 1+ All
21.11 53.67 25.22

Document retrieval (TODO)

Installing anserini

Here, we use the open-source information retrieval toolkit anserini which is built on Lucene. Below are the installation steps for a mac computer (tested on OSX 11.2.3) based on their colab demo.

# install maven
brew install adoptopenjdk
brew install maven

# cloning / installing anserini
git clone https://github.com/castorini/anserini.git --recurse-submodules
cd anserini/
# for 11.1 issues -> add the following in pom.xml
# <plugin>
#   <groupId>org.apache.maven.plugins</groupId>
#   <artifactId>maven-surefire-plugin</artifactId>
#   <configuration>
#     <forkCount>3</forkCount>
#     <reuseForks>true</reuseForks>
#     <argLine>-Xmx1024m -XX:MaxPermSize=256m</argLine>
#   </configuration>
# </plugin>
# for 10.14 issues -> changing jacoco from 0.8.2 to 0.8.3 in pom.xml to build correctly
# for 10.13 issues -> https://github.com/castorini/anserini/issues/648
mvn clean package appassembler:assemble

# compile evaluation tools and other scripts
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

Converting citations and building indexes

./src/0_create_data.sh
./src/1_create_indexes.sh

Creating queries/qrels and retrieving citations using BM25

./src/2_retrieve.sh

Reranking using SciBERT

python3 src/scibert_reranker.py --collection data/docs/collection.jsonl \
                                --contexts data/topics+qrels/contexts.jsonl \
                                --input output/run.t+a+k.description.paragraphs.bm25.txt \
                                --output output/run.t+a+k.description.paragraphs.scibert.txt

python3 src/scibert_reranker.py --collection data/docs/collection.jsonl \
                                --contexts data/topics+qrels/sentences.jsonl \
                                --input output/run.t+a+k.description.sentences.bm25.txt \
                                --output output/run.t+a+k.description.sentences.scibert.txt

python3 src/add-reranker.py output/run.t+a+k.description.paragraphs.bm25.txt \
                            output/run.t+a+k.description.paragraphs.scibert.txt \
                            output/run.t+a+k.description.paragraphs.bm25+scibert.txt
                            
python3 src/add-reranker.py output/run.t+a+k.description.sentences.bm25.txt \
                            output/run.t+a+k.description.sentences.scibert.txt \
                            output/run.t+a+k.description.sentences.bm25+scibert.txt

Evaluating the retrieval models

./src/3_evaluate.sh

recall_10               all 0.3440
ndcg_cut_10             all 0.2937
Evaluating output/run.t+a+k.description.paragraphs.bm25.txt
recall_10               all 0.3358
ndcg_cut_10             all 0.2827
Evaluating output/run.t+a+k.description.paragraphs.scibert.txt
recall_10               all 0.3397
ndcg_cut_10             all 0.2307
Evaluating output/run.t+a+k.description.sentences.bm25+scibert.txt
recall_10               all 0.4030
ndcg_cut_10             all 0.3231
Evaluating output/run.t+a+k.description.sentences.bm25.txt
recall_10               all 0.3903
ndcg_cut_10             all 0.3156
Evaluating output/run.t+a+k.description.sentences.scibert.txt
recall_10               all 0.3340
ndcg_cut_10             all 0.1761

acm-cr's People

Contributors

boudinfl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

goyalkaraniit

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.