codait / identifying-incorrect-labels-in-conll-2003 Goto Github PK
View Code? Open in Web Editor NEWResearch into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
License: Apache License 2.0
Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
License: Apache License 2.0
Currently the script doesn't apply sentence boundary corrections, but it should. This can be done after the camera-ready copy; we currently apply sentence boundary corrections by hand.
Dear authors, thank you for sharing your work. I was wondering if you can also provide the code for identifying the incorrect labels. From what I understand the label corrections to produce a corrected version are given, however, the code to reproduce is not available.
Best regards,
Georgios
In #26, the combined corrections file all_conll_corrections_combined.csv
was patched to correct errors when applying corrections on the corpus. We need to fix the root causes in files under corrected_labels/human_labels_audited
and run scripts/Label_Stats.ipynb
again to generate a new all_conll_corrections_combined.csv
with those corrections.
"I-O" is not a valid tag. It should be "O". We corrected this by hand in our experiments, but we should fix this in the script.
eng.testa
11276:Kloof NNP I-NP I-O
11277:Gold NNP I-NP I-O
11278:Mining NNP I-NP I-O
11279:Co NNP I-NP I-O
42162:first JJ I-NP I-O
42163:division NN I-NP I-O
42217:first JJ I-NP I-O
42218:division NN I-NP I-O
eng.testb
12669:Zywiec NNP I-NP I-O
12670:Full NNP I-NP I-O
12671:Light NNP I-NP I-O
In document 42 of the "dev" fold, the lines:
at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O I-ORG
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP I-ORG
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O
have the following three corrections applied to them:
dev,42,"[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co'",ORG,Span,"[476, 500): 'Driefontein Consolidated'",ORG,
dev,42,,,Missing,"[505, 516): 'Gold Fields'",ORG,
dev,42,,,Missing,"[519, 539): 'Kloof Gold Mining Co'",ORG,
After these corrections, the lines should be tagged as follows:
at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O O
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP O
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O
Instead, download_and_correct_corpus.py
produces this output:
at IN I-PP O
Driefontein NNP I-NP O
Consolidated NNP I-NP O
and CC O O
Gold NNP I-NP O
Fields NNP I-NP O
' POS B-NP O
Kloof NNP I-NP I-O
Gold NNP I-NP I-O
Mining NNP I-NP I-O
Co NNP I-NP I-O
this DT B-NP O
The tokens that should be tagged I-ORG
as a result of the two "Missing" type corrections are instead tagged "O".
Fix any outdated READMEs and spruce them up a bit
Again thanks for sharing this work. The annotations look much improved over the original CoNLL. Especially sports teams that originally were not well labeled are now much better.
However, is seems that some sports teams in the test split are still labeled as LOC, but I think they should be ORG. For example, in the first few sentences we see a couple of instances like this:
SOCCER NN I-NP O
- : O O
JAPAN NNP I-NP I-ORG
GET VB I-VP O
LUCKY NNP I-NP O
WIN NNP I-NP O
, , O O
CHINA NNP I-NP I-LOC
IN IN I-PP O
SURPRISE DT I-NP O
DEFEAT NN I-NP O
. . O O
But CC O O
China NNP I-NP I-ORG
saw VBD I-VP O
their PRP$ I-NP O
luck NN I-NP O
desert VB I-VP O
them PRP I-NP O
in IN I-PP O
the DT I-NP O
second NN I-NP O
match NN I-NP O
of IN I-PP O
the DT I-NP O
group NN I-NP O
, , O O
crashing VBG I-VP O
to TO I-PP O
a DT I-NP O
surprise NN I-NP O
2-0 CD I-NP O
defeat NN I-NP O
to TO I-PP O
newcomers NNS I-NP O
Uzbekistan NNP I-NP I-LOC
. . O O
Japan NNP I-NP I-LOC
coach NN I-NP O
Shu NNP I-NP I-PER
Kamo NNP I-NP I-PER
said VBD I-VP O
...
In these examples "CHINA", "Uzbekistan" and "Japan" should be ORG (the other teams "JAPAN" and "China" in this example are labeled as ORG).
Currently an error when inferring correct entity type for CoNLL-3_train_in_gold
After running python scripts/download_corpus_and_correct_labels.py
, all generated dataset files are single-line files, which are not correct. Have to fix this before publishing this new integration script.
Dear authors, I am trying to reproduce the results you reported in Table 3, for the case of "Akbik et al., 2019" trained and tested on the original corpus. Unfortunately, my results are different from the ones you reported. Furthermore, the results I am getting when using the author provided tagger (model) are still different from the ones reported. Do you have any insights on this? Thank you in advance!
Need to fix two warnings by the script on the training dataset:
[WARNING] Could not find [224, 257): 'OCASEK GOVERNMENT OFFICE BUILDING'
[WARNING] Could not find [21, 24): 'T&N'
When verifying this notebook in #21 , I got a div by 0 error here #21 (comment) and could not get the correct output. Need to check if this is a real problem with data.
Hi all, thanks for making this work available!
I ran the scripts and got one entity marked as "MIC" instead of "MISC" in the training split:
Czech JJ I-NP I-MIC
midfielder NN I-NP O
Pavel NNP I-NP I-PER
Nedved NNP I-NP I-PER
Test set:
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
Dev set:
There are 8 lines ending up with I-O
, which is an invalid tag.
It shouldn't end up with a space, but seems like the current version we have has that space. Need to double check this when finalizing the published dataset.
Hi,
Thank you for identifying these errors and releasing them, the explanations file justifying the corrections has been particularly helpful!
I am curious about the changes that affect national team mentions, the labels of which have been changed from LOC to ORG.
While this change makes sense to me, it conflicts with the MUC guidelines which state:
A.2.2 Miscellaneous ORG-type Entity-Expressions
Miscellaneous types of proper names that are to be tagged as ORGANIZATION include stock exchanges, multinational organizations, political parties, orchestras, unions, non-generic governmental entity names such as "Congress" or "Chamber of Deputies", sports teams and armies (unless designated only by country names, which are tagged as LOCATION),
They also have an example:
"In hockey action, Russia defeated France by a score of 7 to 3."
...Russia defeated France
Were these changes made in agreement with an updated guideline / agreement between annotators?
Thanks!
After submitting the camera-ready, we found some minor omissions in the file of sentence boundary corrections that we used for the paper.
Once we've finished cleaning up a version of the corrected data set to go with the paper, we should update the sentence boundary corrections file with some additional corrections.
The script to regenerate the sentence corrections is scripts/sentence_correction_preprocessing.ipynb
.
In document 7 of the dev fold, the corpus contains the entity
[1004, 1022): 'Boxing Association' / ORG
We currently mark this as a "Span" type error, with the corrected span being
[993, 1022): 'Panamanian Boxing Association'
This correction appears to be not correct, based on looking through Wikipedia. There does not appear to be any organization called the "Panamanian Boxing Association". The organization that document refers to looks to be the World Boxing Association. The World Boxing Association is based in Panama; see https://en.wikipedia.org/wiki/World_Boxing_Association
All should run without errors, and any warnings accounted
When I install the requirements via pip and then run python scripts/download_and_correct_corpus.py, I get
AttributeError: module 'pandas.core.ops' has no attribute '_get_op_name'
I run on WIndows with python 3.9.2. I saw a similar issue in hgrecco/pint-pandas#51 .
Currently download_and_correct_corpus.py
has the following warning:
[WARNING] Could not find [16, 22): 'S Minn': No span begins with 16
[WARNING] Could not find [16, 22): 'S Minn': No span begins with 16
Go through the manual fixes (for token corrections) that were applied towards the end of the paper-writing process and either automate or semi-automate those fixes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.