Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.

License: Apache License 2.0

Jupyter Notebook 98.90% Python 1.10%

identifying-incorrect-labels-in-conll-2003's People

Contributors

Stargazers

Watchers

Forkers

frreiss katrin24

identifying-incorrect-labels-in-conll-2003's Issues

download_corpus_and_correct_labels should also apply sentence boundary corrections

Currently the script doesn't apply sentence boundary corrections, but it should. This can be done after the camera-ready copy; we currently apply sentence boundary corrections by hand.

Code for Identifying Incorrect Labels

Dear authors, thank you for sharing your work. I was wondering if you can also provide the code for identifying the incorrect labels. From what I understand the label corrections to produce a corrected version are given, however, the code to reproduce is not available.

Best regards,
Georgios

Correct root cause of combined correction patches in annotated files

In #26, the combined corrections file all_conll_corrections_combined.csv was patched to correct errors when applying corrections on the corpus. We need to fix the root causes in files under corrected_labels/human_labels_audited and run scripts/Label_Stats.ipynb again to generate a new all_conll_corrections_combined.csv with those corrections.

About 10 lines are tagged as "I-O"

"I-O" is not a valid tag. It should be "O". We corrected this by hand in our experiments, but we should fix this in the script.

eng.testa
11276:Kloof NNP I-NP I-O
11277:Gold NNP I-NP I-O
11278:Mining NNP I-NP I-O
11279:Co NNP I-NP I-O
42162:first JJ I-NP I-O
42163:division NN I-NP I-O
42217:first JJ I-NP I-O
42218:division NN I-NP I-O

eng.testb
12669:Zywiec NNP I-NP I-O
12670:Full NNP I-NP I-O
12671:Light NNP I-NP I-O

download_and_correct_corpus.py does not handle "Missing" errors that overlap with "Span"

In document 42 of the "dev" fold, the lines:

at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O I-ORG
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP I-ORG
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O

have the following three corrections applied to them:

dev,42,"[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co'",ORG,Span,"[476, 500): 'Driefontein Consolidated'",ORG,
dev,42,,,Missing,"[505, 516): 'Gold Fields'",ORG,
dev,42,,,Missing,"[519, 539): 'Kloof Gold Mining Co'",ORG,

After these corrections, the lines should be tagged as follows:

at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O O
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP O
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O

Instead, download_and_correct_corpus.py produces this output:

at IN I-PP O
Driefontein NNP I-NP O
Consolidated NNP I-NP O
and CC O O
Gold NNP I-NP O
Fields NNP I-NP O
' POS B-NP O
Kloof NNP I-NP I-O
Gold NNP I-NP I-O
Mining NNP I-NP I-O
Co NNP I-NP I-O
this DT B-NP O

The tokens that should be tagged I-ORG as a result of the two "Missing" type corrections are instead tagged "O".

Cleanup READMEs

Fix any outdated READMEs and spruce them up a bit

Sports teams in test split

Again thanks for sharing this work. The annotations look much improved over the original CoNLL. Especially sports teams that originally were not well labeled are now much better.

However, is seems that some sports teams in the test split are still labeled as LOC, but I think they should be ORG. For example, in the first few sentences we see a couple of instances like this:

SOCCER NN I-NP O
- : O O
JAPAN NNP I-NP I-ORG
GET VB I-VP O
LUCKY NNP I-NP O
WIN NNP I-NP O
, , O O
CHINA NNP I-NP I-LOC
IN IN I-PP O
SURPRISE DT I-NP O
DEFEAT NN I-NP O
. . O O

But CC O O
China NNP I-NP I-ORG
saw VBD I-VP O
their PRP$ I-NP O
luck NN I-NP O
desert VB I-VP O
them PRP I-NP O
in IN I-PP O
the DT I-NP O
second NN I-NP O
match NN I-NP O
of IN I-PP O
the DT I-NP O
group NN I-NP O
, , O O
crashing VBG I-VP O
to TO I-PP O
a DT I-NP O
surprise NN I-NP O
2-0 CD I-NP O
defeat NN I-NP O
to TO I-PP O
newcomers NNS I-NP O
Uzbekistan NNP I-NP I-LOC
. . O O

Japan NNP I-NP I-LOC
coach NN I-NP O
Shu NNP I-NP I-PER
Kamo NNP I-NP I-PER
said VBD I-VP O
...

In these examples "CHINA", "Uzbekistan" and "Japan" should be ORG (the other teams "JAPAN" and "China" in this example are labeled as ORG).

Fix errors in Label_Stats notebook

Currently an error when inferring correct entity type for CoNLL-3_train_in_gold

All corrected datasets are single-line files

After running python scripts/download_corpus_and_correct_labels.py, all generated dataset files are single-line files, which are not correct. Have to fix this before publishing this new integration script.

Reproduce results

Dear authors, I am trying to reproduce the results you reported in Table 3, for the case of "Akbik et al., 2019" trained and tested on the original corpus. Unfortunately, my results are different from the ones you reported. Furthermore, the results I am getting when using the author provided tagger (model) are still different from the ones reported. Do you have any insights on this? Thank you in advance!

two misaligned/missing entities `[224, 257): ‘OCASEK GOVERNMENT OFFICE BUILDING’` and `[21, 24): ‘T&N’`

Need to fix two warnings by the script on the training dataset:

[WARNING] Could not find [224, 257): 'OCASEK GOVERNMENT OFFICE BUILDING'
[WARNING] Could not find [21, 24): 'T&N'

Add detailed steps on how we produced the experimental results

Inter-annotator agreement notebook give div by 0 error

When verifying this notebook in #21 , I got a div by 0 error here #21 (comment) and could not get the correct output. Need to check if this is a real problem with data.

One entity marked as "MIC" instead of "MISC"

Hi all, thanks for making this work available!

I ran the scripts and got one entity marked as "MIC" instead of "MISC" in the training split:

Czech JJ I-NP I-MIC
midfielder NN I-NP O
Pavel NNP I-NP I-PER
Nedved NNP I-NP I-PER

Some spots require correction by hand

Test set:

Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.

Dev set:

There are 8 lines ending up with I-O, which is an invalid tag.

In the finalized dataset, examine whether in the test corpus, "Germay" line ends up with a space

It shouldn't end up with a space, but seems like the current version we have has that space. Need to double check this when finalizing the published dataset.

Changed label of national team names from LOC -> ORG incompatible with MUC guidelines

Hi,

Thank you for identifying these errors and releasing them, the explanations file justifying the corrections has been particularly helpful!

I am curious about the changes that affect national team mentions, the labels of which have been changed from LOC to ORG.
While this change makes sense to me, it conflicts with the MUC guidelines which state:

A.2.2 Miscellaneous ORG-type Entity-Expressions
Miscellaneous types of proper names that are to be tagged as ORGANIZATION include stock exchanges, multinational organizations, political parties, orchestras, unions, non-generic governmental entity names such as "Congress" or "Chamber of Deputies", sports teams and armies (unless designated only by country names, which are tagged as LOCATION),

They also have an example:

"In hockey action, Russia defeated France by a score of 7 to 3."
...Russia defeated France

Were these changes made in agreement with an updated guideline / agreement between annotators?

Thanks!

Update sentence boundaries file

After submitting the camera-ready, we found some minor omissions in the file of sentence boundary corrections that we used for the paper.

Once we've finished cleaning up a version of the corrected data set to go with the paper, we should update the sentence boundary corrections file with some additional corrections.

The script to regenerate the sentence corrections is scripts/sentence_correction_preprocessing.ipynb.

Span marked incorrect in dev fold document 7 may actually be correct

In document 7 of the dev fold, the corpus contains the entity

[1004, 1022): 'Boxing Association' / ORG

We currently mark this as a "Span" type error, with the corrected span being

[993, 1022): 'Panamanian Boxing   Association'

This correction appears to be not correct, based on looking through Wikipedia. There does not appear to be any organization called the "Panamanian Boxing Association". The organization that document refers to looks to be the World Boxing Association. The World Boxing Association is based in Panama; see https://en.wikipedia.org/wiki/World_Boxing_Association

Verfiy scripts and notebooks

All should run without errors, and any warnings accounted

Broken with Pandas 1.2

When I install the requirements via pip and then run python scripts/download_and_correct_corpus.py, I get

AttributeError: module 'pandas.core.ops' has no attribute '_get_op_name'

I run on WIndows with python 3.9.2. I saw a similar issue in hgrecco/pint-pandas#51 .

Fix correction script with span [16, 22): 'S Minn'

Currently download_and_correct_corpus.py has the following warning:

[WARNING] Could not find [16, 22): 'S Minn': No span begins with 16
[WARNING] Could not find [16, 22): 'S Minn': No span begins with 16

Automate token corrections

Go through the manual fixes (for token corrections) that were applied towards the end of the paper-writing process and either automate or semi-automate those fixes.

codait / identifying-incorrect-labels-in-conll-2003 Goto Github PK

identifying-incorrect-labels-in-conll-2003's People

Contributors

Stargazers

Watchers

Forkers

identifying-incorrect-labels-in-conll-2003's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs