GithubHelp home page GithubHelp logo

transformer-conll's Introduction

CoNLL-2003 English Named Entity Recognition.

NER challenge for CoNLL-2003 English. Annotations were taken from University of Antwerp. The English data is a collection of news wire articles from the Reuters Corpus, RCV1.

Format of the train set

The train set has just two columns separated by TABs:

  • the expected BIO labels,
  • the docuemnt.

Each line is a separate training item. Note that this is TSV format, not CSV, double quotes are not interpreted in a special way!

Preprocessing snippet located here

End-of-lines inside documents were replaced with the '' tag.

The train is compressed with the xz compressor, in order to see a random sample of 10 training items, run:

xzcat train/train.tsv.xz | shuf -n 10 | less -S

(The -S disables line wrapping, press "q" to exit less browser.)

Format of the test sets

For the test sets, the input data is given in two files: the text in in.tsv and the expected labels in expected.tsv. (The files have .tsv extensions for consistency but actually they do not contain TABs.)

To see the first 5 test items run:

cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5

The expected.tsv file for the test-A test set is hidden and is not available in the master branch.

Evaluation metrics

One evaluation metric is used:

  • BIO-F1

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • train/ — directory with training data
  • train/train.tsv.xz — train set
  • dev-0/ — directory with dev (test) data (split preserved from CoNLL-2003)
  • dev-0/in.tsv — input data for the dev set
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • test-A — directory with test data
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set (hidden from the developers, not available in the master branch)

Usually, there is no reason to change these files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.