GithubHelp home page GithubHelp logo

isabella232 / deft_corpus Goto Github PK

View Code? Open in Web Editor NEW

This project forked from adobe-research/deft_corpus

0.0 0.0 0.0 41.5 MB

The Definition Extraction From Text corpus and relevant formatting scripts

License: Other

Python 100.00%

deft_corpus's Introduction

Welcome to the DEFT corpus!

Welcome to the largest expertly annotated corpus for complex definition extraction in free text. Pardon our dust - this data is associated with SemEval 2020 Task 6 (DeftEval) and we are releasing the full dataset on the SemEval conference schedule. Train and dev data are available, and test data will become available after the completion of the SemEval evaluation period on 2 Feb 2020. You can source the complete text from the corresponding textbooks at https://cnx.org.

The most recent version of the corpus was updated on 16 JAN 2020.

For more information regarding the annotation, schema, or general characteristics of the corpus, please see our paper here.

Data Format

We are currently releasing annotated data using a CoNLL 2003-like format with the following structure:

TOKEN TXT_SOURCE_FILE START_CHAR END_CHAR TAG TAG_ID ROOT_ID RELATION

Character indices are derived from the brat standoff format. Tags follow a BIO format with the tag schema outlined in the paper.

DeftEval Results

Results for SemEval 2020 Task 6 - DeftEval are included below:

Subtask 1 Results

Subtask 2 Results

Subtask 3 Results

We will continue to update the official leaderboard as the final evaluation period closes.

Licensing Information

The entire dataset of textbook sentences with annotations is available for use under the CC BY-NC-SA 4.0 license. Contact the authors for information on commercial use.

Acknowledgements

We would like to acknowledge the contributions of the annotation team, without which we would not have a corpus to share. Many thanks to Lucino Chiafullo, Danyi Huang, Micaela Kaplan, Roger LaCroix, Molly Moran, Jennifer Pei-Hsuan Lee, Harper Pollio-Barbee, and Keren Sun for their annotations and contributions.

Citation

If you use the DEFT corpus in your publication, please cite this paper:

@inproceedings{spala-etal-2019-deft,
    title = "{DEFT}: A corpus for definition extraction in free- and semi-structured text",
    author = "Spala, Sasha  and
      Miller, Nicholas A.  and
      Yang, Yiming  and
      Dernoncourt, Franck  and
      Dockhorn, Carl",
    booktitle = "Proceedings of the 13th Linguistic Annotation Workshop",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4015",
    pages = "124--131",
    abstract = "Definition extraction has been a popular topic in NLP research for well more than a decade, but has been historically limited to well-defined, structured, and narrow conditions. In reality, natural language is messy, and messy data requires both complex solutions and data that reflects that reality. In this paper, we present a robust English corpus and annotation schema that allows us to explore the less straightforward examples of term-definition structures in free and semi-structured text.",
}

deft_corpus's People

Contributors

franck-dernoncourt avatar marchbnr avatar sashaspala avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.