GithubHelp home page GithubHelp logo

sebastian-hofstaetter / fira-trec-19-dataset Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 0.0 8.26 MB

Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering

License: MIT License

Python 100.00%
document-ranking annotations

fira-trec-19-dataset's Introduction

FiRA: Fine-grained Relevance Annotations for TREC-DL'19

Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering Sebastian Hofstätter, Markus Zlabinger, Mete Sertkan, Michael Schröder and Allan Hanbury. CIKM 2020 (Resource Track)

https://arxiv.org/abs/2008.05363

tl;dr We present FiRA: a novel dataset of Fine-Grained Relevance Annotations 🔍. We extend the ranked retrieval annotations of the Deep Learning Document Track of TREC 2019 with passage and word level graded relevance annotations for all relevant documents.

We split the documents into snippets and displayed query & document snippet pairs to annotators. In the following Figure, we show an example of an annotated document-snippet. We ensure a high quality by employing at least 3-way majority voting for every candidate and continuous monitoring of quality parameters during our annotation campaign, such as the time spent per annotation. Furthermore, by requiring annotators to select relevant text spans, we reduce the probability of false-positive retrieval relevance labels.

Furthermore, we selected 10 pairs to be annotated by all our annotators, so that we can study their subjectivity with a dense distribution, as shown in the following Figure:

Comparison of two completely judged document snippets. The background heatmap for each word displays the number of times this word was part of a relevant region selection. The darker the green the more often annotators selected this word as being relevant.

With FiRA annotations we can observe a distribution of relevance inside long documents. The following Figure shows, the bias towards earlier regions of the documents, however our results also highlight a significant amount of relevant snippets can be found later in the documents.

Please cite us as:

@inproceedings{Hofstaetter2020_fira,
 author = {Hofst{\"a}tter, Sebastian and Zlabinger, Markus and Sertkan, Mete and Schr{\"o}der, Michael and Hanbury, Allan},
 title = {Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering},
 booktitle = {Proc. of CIKM},
 year = {2020},
}

Usage Scenarios

The FiRA-TREC'19 dataset can be utilized for every approach concerned with long document ranking and the selection of snippets or answers from these documents.

As an example we used it to better understand and evaluate our TKL document ranking model.

Dataset

The FiRA dataset contains 24,199 query & document-snippet pairs of all 1,990 relevant documents for the 43 queries of TREC-DL.

The dataset folder contains:

  • Our processed document-snippets and annotation task list
  • Raw (anonymized) annotations
  • Majority voted judgements for document-snippet and document levels

Acknowledgements

We want to thank our students of the Advanced Information Retrieval course in the summer term of 2020 for annotating the data and being so patient and motivated in the process.

fira-trec-19-dataset's People

Contributors

sebastian-hofstaetter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.