GithubHelp home page GithubHelp logo

codait / identifying-incorrect-labels-in-conll-2003 Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 2.0 11.52 MB

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.

License: Apache License 2.0

Jupyter Notebook 98.90% Python 1.10%

identifying-incorrect-labels-in-conll-2003's People

Contributors

bryancutler avatar frreiss avatar kmh4321 avatar xuhdev avatar zacheichen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

frreiss katrin24

identifying-incorrect-labels-in-conll-2003's Issues

Code for Identifying Incorrect Labels

Dear authors, thank you for sharing your work. I was wondering if you can also provide the code for identifying the incorrect labels. From what I understand the label corrections to produce a corrected version are given, however, the code to reproduce is not available.

Best regards,
Georgios

Correct root cause of combined correction patches in annotated files

In #26, the combined corrections file all_conll_corrections_combined.csv was patched to correct errors when applying corrections on the corpus. We need to fix the root causes in files under corrected_labels/human_labels_audited and run scripts/Label_Stats.ipynb again to generate a new all_conll_corrections_combined.csv with those corrections.

About 10 lines are tagged as "I-O"

"I-O" is not a valid tag. It should be "O". We corrected this by hand in our experiments, but we should fix this in the script.

eng.testa
11276:Kloof NNP I-NP I-O
11277:Gold NNP I-NP I-O
11278:Mining NNP I-NP I-O
11279:Co NNP I-NP I-O
42162:first JJ I-NP I-O
42163:division NN I-NP I-O
42217:first JJ I-NP I-O
42218:division NN I-NP I-O

eng.testb
12669:Zywiec NNP I-NP I-O
12670:Full NNP I-NP I-O
12671:Light NNP I-NP I-O

download_and_correct_corpus.py does not handle "Missing" errors that overlap with "Span"

In document 42 of the "dev" fold, the lines:

at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O I-ORG
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP I-ORG
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O

have the following three corrections applied to them:

dev,42,"[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co'",ORG,Span,"[476, 500): 'Driefontein Consolidated'",ORG,
dev,42,,,Missing,"[505, 516): 'Gold Fields'",ORG,
dev,42,,,Missing,"[519, 539): 'Kloof Gold Mining Co'",ORG,

After these corrections, the lines should be tagged as follows:

at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O O
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP O
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O

Instead, download_and_correct_corpus.py produces this output:

at IN I-PP O
Driefontein NNP I-NP O
Consolidated NNP I-NP O
and CC O O
Gold NNP I-NP O
Fields NNP I-NP O
' POS B-NP O
Kloof NNP I-NP I-O
Gold NNP I-NP I-O
Mining NNP I-NP I-O
Co NNP I-NP I-O
this DT B-NP O

The tokens that should be tagged I-ORG as a result of the two "Missing" type corrections are instead tagged "O".

Sports teams in test split

Again thanks for sharing this work. The annotations look much improved over the original CoNLL. Especially sports teams that originally were not well labeled are now much better.

However, is seems that some sports teams in the test split are still labeled as LOC, but I think they should be ORG. For example, in the first few sentences we see a couple of instances like this:

SOCCER NN I-NP O
- : O O
JAPAN NNP I-NP I-ORG
GET VB I-VP O
LUCKY NNP I-NP O
WIN NNP I-NP O
, , O O
CHINA NNP I-NP I-LOC
IN IN I-PP O
SURPRISE DT I-NP O
DEFEAT NN I-NP O
. . O O

But CC O O
China NNP I-NP I-ORG
saw VBD I-VP O
their PRP$ I-NP O
luck NN I-NP O
desert VB I-VP O
them PRP I-NP O
in IN I-PP O
the DT I-NP O
second NN I-NP O
match NN I-NP O
of IN I-PP O
the DT I-NP O
group NN I-NP O
, , O O
crashing VBG I-VP O
to TO I-PP O
a DT I-NP O
surprise NN I-NP O
2-0 CD I-NP O
defeat NN I-NP O
to TO I-PP O
newcomers NNS I-NP O
Uzbekistan NNP I-NP I-LOC
. . O O

Japan NNP I-NP I-LOC
coach NN I-NP O
Shu NNP I-NP I-PER
Kamo NNP I-NP I-PER
said VBD I-VP O
...

In these examples "CHINA", "Uzbekistan" and "Japan" should be ORG (the other teams "JAPAN" and "China" in this example are labeled as ORG).

All corrected datasets are single-line files

After running python scripts/download_corpus_and_correct_labels.py, all generated dataset files are single-line files, which are not correct. Have to fix this before publishing this new integration script.

Reproduce results

Dear authors, I am trying to reproduce the results you reported in Table 3, for the case of "Akbik et al., 2019" trained and tested on the original corpus. Unfortunately, my results are different from the ones you reported. Furthermore, the results I am getting when using the author provided tagger (model) are still different from the ones reported. Do you have any insights on this? Thank you in advance!

One entity marked as "MIC" instead of "MISC"

Hi all, thanks for making this work available!

I ran the scripts and got one entity marked as "MIC" instead of "MISC" in the training split:

Czech JJ I-NP I-MIC
midfielder NN I-NP O
Pavel NNP I-NP I-PER
Nedved NNP I-NP I-PER

Some spots require correction by hand

Test set:

Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.

Dev set:

There are 8 lines ending up with I-O, which is an invalid tag.

Changed label of national team names from LOC -> ORG incompatible with MUC guidelines

Hi,

Thank you for identifying these errors and releasing them, the explanations file justifying the corrections has been particularly helpful!

I am curious about the changes that affect national team mentions, the labels of which have been changed from LOC to ORG.
While this change makes sense to me, it conflicts with the MUC guidelines which state:

A.2.2 Miscellaneous ORG-type Entity-Expressions
Miscellaneous types of proper names that are to be tagged as ORGANIZATION include stock exchanges, multinational organizations, political parties, orchestras, unions, non-generic governmental entity names such as "Congress" or "Chamber of Deputies", sports teams and armies (unless designated only by country names, which are tagged as LOCATION),

They also have an example:

"In hockey action, Russia defeated France by a score of 7 to 3."
...Russia defeated France

Were these changes made in agreement with an updated guideline / agreement between annotators?

Thanks!

Update sentence boundaries file

After submitting the camera-ready, we found some minor omissions in the file of sentence boundary corrections that we used for the paper.

Once we've finished cleaning up a version of the corrected data set to go with the paper, we should update the sentence boundary corrections file with some additional corrections.

The script to regenerate the sentence corrections is scripts/sentence_correction_preprocessing.ipynb.

Span marked incorrect in dev fold document 7 may actually be correct

In document 7 of the dev fold, the corpus contains the entity

[1004, 1022): 'Boxing Association' / ORG

We currently mark this as a "Span" type error, with the corrected span being

[993, 1022): 'Panamanian Boxing   Association'

This correction appears to be not correct, based on looking through Wikipedia. There does not appear to be any organization called the "Panamanian Boxing Association". The organization that document refers to looks to be the World Boxing Association. The World Boxing Association is based in Panama; see https://en.wikipedia.org/wiki/World_Boxing_Association

Broken with Pandas 1.2

When I install the requirements via pip and then run python scripts/download_and_correct_corpus.py, I get

AttributeError: module 'pandas.core.ops' has no attribute '_get_op_name'

I run on WIndows with python 3.9.2. I saw a similar issue in hgrecco/pint-pandas#51 .

Fix correction script with span [16, 22): 'S Minn'

Currently download_and_correct_corpus.py has the following warning:

[WARNING] Could not find [16, 22): 'S Minn': No span begins with 16
[WARNING] Could not find [16, 22): 'S Minn': No span begins with 16

Automate token corrections

Go through the manual fixes (for token corrections) that were applied towards the end of the paper-writing process and either automate or semi-automate those fixes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.