GithubHelp home page GithubHelp logo

deft_corpus's People

Contributors

franck-dernoncourt avatar marchbnr avatar sashaspala avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deft_corpus's Issues

CSV Parser in the evaluation script is not handling quotes correctly

The csv parsers are missing a configuration for "quote_char" and "quoting". This results in incorrect parses of some examples.
One example is char 773 in the beginning of file "data/deft_files/dev/t3_physics_2_101.deft":

2951	 data/source_txt/t3_physics_2_101.deft	 759	 763	 O	 -1	 -1	 0
.	 data/source_txt/t3_physics_2_101.deft	 763	 764	 O	 -1	 -1	 0

3	 data/source_txt/t3_physics_2_101.deft	 765	 766	 O	 -1	 -1	 0
times	 data/source_txt/t3_physics_2_101.deft	 767	 772	 O	 -1	 -1	 0
"	 data/source_txt/t3_physics_2_101.deft	 773	 774	 O	 -1	 -1	 0
10	 data/source_txt/t3_physics_2_101.deft	 774	 776	 O	 -1	 -1	 0
"	 data/source_txt/t3_physics_2_101.deft	 776	 777	 O	 -1	 -1	 0
rSup	 data/source_txt/t3_physics_2_101.deft	 778	 782	 O	 -1	 -1	 0
{	 data/source_txt/t3_physics_2_101.deft	 783	 784	 O	 -1	 -1	 0
size	 data/source_txt/t3_physics_2_101.deft	 785	 789	 O	 -1	 -1	 0
...

This results in some of the failed assertions reported in the forums.

how to construct dataset for Subtask1

Subtask 1: Sentence Classification

Given a sentence, classify whether or not it contains a definition. This is the traditional definition extraction task.

Does this mean that the sentence does not contain a definition only when the tag of each token in a sentence is “O”?

[TOKENIZATION] #3

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t1_biology_2_202.deft

Content

As	 data/source_txt/t1_biology_rlacroix_202.txt	 13106	 13108	 O	 -1	 -1	 0
shown	 data/source_txt/t1_biology_rlacroix_202.txt	 13109	 13114	 O	 -1	 -1	 0
in	 data/source_txt/t1_biology_rlacroix_202.txt	 13115	 13117	 O	 -1	 -1	 0
[	 data/source_txt/t1_biology_rlacroix_202.txt	 13118	 13119	 O	 -1	 -1	 0
link]a	 data/source_txt/t1_biology_rlacroix_202.txt	 13119	 13125	 O	 -1	 -1	 0
,	 data/source_txt/t1_biology_rlacroix_202.txt	 13125	 13126	 O	 -1	 -1	 0
some	 data/source_txt/t1_biology_rlacroix_202.txt	 13127	 13131	 B-Definition	 T77	 0	 Refers-To
individual	 data/source_txt/t1_biology_rlacroix_202.txt	 13132	 13142	 I-Definition	 T77	 0	 Refers-To
prokaryotes	 data/source_txt/t1_biology_rlacroix_202.txt	 13143	 13154	 I-Definition	 T77	 0	 Refers-To
were	 data/source_txt/t1_biology_rlacroix_202.txt	 13155	 13159	 I-Definition	 T77	 0	 Refers-To
responsible	 data/source_txt/t1_biology_rlacroix_202.txt	 13160	 13171	 I-Definition	 T77	 0	 Refers-To
for	 data/source_txt/t1_biology_rlacroix_202.txt	 13172	 13175	 I-Definition	 T77	 0	 Refers-To
transferring	 data/source_txt/t1_biology_rlacroix_202.txt	 13176	 13188	 I-Definition	 T77	 0	 Refers-To
the	 data/source_txt/t1_biology_rlacroix_202.txt	 13189	 13192	 I-Definition	 T77	 0	 Refers-To
bacteria	 data/source_txt/t1_biology_rlacroix_202.txt	 13193	 13201	 I-Definition	 T77	 0	 Refers-To
that	 data/source_txt/t1_biology_rlacroix_202.txt	 13202	 13206	 I-Definition	 T77	 0	 Refers-To
caused	 data/source_txt/t1_biology_rlacroix_202.txt	 13207	 13213	 I-Definition	 T77	 0	 Refers-To
mitochondrial	 data/source_txt/t1_biology_rlacroix_202.txt	 13214	 13227	 I-Definition	 T77	 0	 Refers-To
development	 data/source_txt/t1_biology_rlacroix_202.txt	 13228	 13239	 I-Definition	 T77	 0	 Refers-To
to	 data/source_txt/t1_biology_rlacroix_202.txt	 13240	 13242	 I-Definition	 T77	 0	 Refers-To
the	 data/source_txt/t1_biology_rlacroix_202.txt	 13243	 13246	 I-Definition	 T77	 0	 Refers-To
new	 data/source_txt/t1_biology_rlacroix_202.txt	 13247	 13250	 I-Definition	 T77	 0	 Refers-To
eukaryotes	 data/source_txt/t1_biology_rlacroix_202.txt	 13251	 13261	 I-Definition	 T77	 0	 Refers-To
,	 data/source_txt/t1_biology_rlacroix_202.txt	 13261	 13262	 I-Definition	 T77	 0	 Refers-To
whereas	 data/source_txt/t1_biology_rlacroix_202.txt	 13263	 13270	 I-Definition	 T77	 0	 Refers-To
other	 data/source_txt/t1_biology_rlacroix_202.txt	 13271	 13276	 I-Definition	 T77	 0	 Refers-To
species	 data/source_txt/t1_biology_rlacroix_202.txt	 13277	 13284	 I-Definition	 T77	 0	 Refers-To
transferred	 data/source_txt/t1_biology_rlacroix_202.txt	 13285	 13296	 I-Definition	 T77	 0	 Refers-To
the	 data/source_txt/t1_biology_rlacroix_202.txt	 13297	 13300	 I-Definition	 T77	 0	 Refers-To
bacteria	 data/source_txt/t1_biology_rlacroix_202.txt	 13301	 13309	 I-Definition	 T77	 0	 Refers-To
that	 data/source_txt/t1_biology_rlacroix_202.txt	 13310	 13314	 I-Definition	 T77	 0	 Refers-To
gave	 data/source_txt/t1_biology_rlacroix_202.txt	 13315	 13319	 I-Definition	 T77	 0	 Refers-To
rise	 data/source_txt/t1_biology_rlacroix_202.txt	 13320	 13324	 I-Definition	 T77	 0	 Refers-To
to	 data/source_txt/t1_biology_rlacroix_202.txt	 13325	 13327	 I-Definition	 T77	 0	 Refers-To
chloroplasts	 data/source_txt/t1_biology_rlacroix_202.txt	 13328	 13340	 I-Definition	 T77	 0	 Refers-To
.	 data/source_txt/t1_biology_rlacroix_202.txt	 13340	 13341	 O	 -1	 -1	 0

Lines 2402-2437. Error in line 2406.

The	 data/source_txt/t1_biology_rlacroix_202.txt	 12685	 12688	 B-Term	 T72	 0	 Direct-Defines
nucleus	 data/source_txt/t1_biology_rlacroix_202.txt	 12689	 12696	 I-Term	 T72	 0	 Direct-Defines
-	 data/source_txt/t1_biology_rlacroix_202.txt	 12696	 12697	 I-Term	 T72	 0	 Direct-Defines
first	 data/source_txt/t1_biology_rlacroix_202.txt	 12697	 12702	 I-Term	 T72	 0	 Direct-Defines
hypothesis	 data/source_txt/t1_biology_rlacroix_202.txt	 12703	 12713	 I-Term	 T72	 0	 Direct-Defines
proposes	 data/source_txt/t1_biology_rlacroix_202.txt	 12714	 12722	 B-Definition	 T73	 T72	 Direct-Defines
that	 data/source_txt/t1_biology_rlacroix_202.txt	 12723	 12727	 I-Definition	 T73	 T72	 Direct-Defines
the	 data/source_txt/t1_biology_rlacroix_202.txt	 12728	 12731	 I-Definition	 T73	 T72	 Direct-Defines
nucleus	 data/source_txt/t1_biology_rlacroix_202.txt	 12732	 12739	 I-Definition	 T73	 T72	 Direct-Defines
evolved	 data/source_txt/t1_biology_rlacroix_202.txt	 12740	 12747	 I-Definition	 T73	 T72	 Direct-Defines
in	 data/source_txt/t1_biology_rlacroix_202.txt	 12748	 12750	 I-Definition	 T73	 T72	 Direct-Defines
prokaryotes	 data/source_txt/t1_biology_rlacroix_202.txt	 12751	 12762	 I-Definition	 T73	 T72	 Direct-Defines
first	 data/source_txt/t1_biology_rlacroix_202.txt	 12763	 12768	 I-Definition	 T73	 T72	 Direct-Defines
(	 data/source_txt/t1_biology_rlacroix_202.txt	 12769	 12770	 I-Definition	 T73	 T72	 Direct-Defines
[	 data/source_txt/t1_biology_rlacroix_202.txt	 12770	 12771	 I-Definition	 T73	 T72	 Direct-Defines
link]a	 data/source_txt/t1_biology_rlacroix_202.txt	 12771	 12777	 I-Definition	 T73	 T72	 Direct-Defines
)	 data/source_txt/t1_biology_rlacroix_202.txt	 12777	 12778	 I-Definition	 T73	 T72	 Direct-Defines
,	 data/source_txt/t1_biology_rlacroix_202.txt	 12778	 12779	 I-Definition	 T73	 T72	 Direct-Defines
followed	 data/source_txt/t1_biology_rlacroix_202.txt	 12780	 12788	 I-Definition	 T73	 T72	 Direct-Defines
by	 data/source_txt/t1_biology_rlacroix_202.txt	 12789	 12791	 I-Definition	 T73	 T72	 Direct-Defines
a	 data/source_txt/t1_biology_rlacroix_202.txt	 12792	 12793	 I-Definition	 T73	 T72	 Direct-Defines
later	 data/source_txt/t1_biology_rlacroix_202.txt	 12794	 12799	 I-Definition	 T73	 T72	 Direct-Defines
fusion	 data/source_txt/t1_biology_rlacroix_202.txt	 12800	 12806	 I-Definition	 T73	 T72	 Direct-Defines
of	 data/source_txt/t1_biology_rlacroix_202.txt	 12807	 12809	 I-Definition	 T73	 T72	 Direct-Defines
the	 data/source_txt/t1_biology_rlacroix_202.txt	 12810	 12813	 I-Definition	 T73	 T72	 Direct-Defines
new	 data/source_txt/t1_biology_rlacroix_202.txt	 12814	 12817	 I-Definition	 T73	 T72	 Direct-Defines
eukaryote	 data/source_txt/t1_biology_rlacroix_202.txt	 12818	 12827	 I-Definition	 T73	 T72	 Direct-Defines
with	 data/source_txt/t1_biology_rlacroix_202.txt	 12828	 12832	 I-Definition	 T73	 T72	 Direct-Defines
bacteria	 data/source_txt/t1_biology_rlacroix_202.txt	 12833	 12841	 I-Definition	 T73	 T72	 Direct-Defines
that	 data/source_txt/t1_biology_rlacroix_202.txt	 12842	 12846	 I-Definition	 T73	 T72	 Direct-Defines
became	 data/source_txt/t1_biology_rlacroix_202.txt	 12847	 12853	 I-Definition	 T73	 T72	 Direct-Defines
mitochondria	 data/source_txt/t1_biology_rlacroix_202.txt	 12854	 12866	 I-Definition	 T73	 T72	 Direct-Defines
.	 data/source_txt/t1_biology_rlacroix_202.txt	 12866	 12867	 O	 -1	 -1	 0

Lines 2322-2345. Error in line 2337.

[TOKENIZATION] #6

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t7_government_2_202.deft

Content

Someone	 data/source_txt/t7_government_rlacroix_202.txt	 23368	 23375	 O	 -1	 -1	 0
concerned	 data/source_txt/t7_government_rlacroix_202.txt	 23376	 23385	 O	 -1	 -1	 0
about	 data/source_txt/t7_government_rlacroix_202.txt	 23386	 23391	 O	 -1	 -1	 0
protecting	 data/source_txt/t7_government_rlacroix_202.txt	 23392	 23402	 O	 -1	 -1	 0
individual	 data/source_txt/t7_government_rlacroix_202.txt	 23403	 23413	 O	 -1	 -1	 0
rights	 data/source_txt/t7_government_rlacroix_202.txt	 23414	 23420	 O	 -1	 -1	 0
might	 data/source_txt/t7_government_rlacroix_202.txt	 23421	 23426	 O	 -1	 -1	 0
join	 data/source_txt/t7_government_rlacroix_202.txt	 23427	 23431	 O	 -1	 -1	 0
a	 data/source_txt/t7_government_rlacroix_202.txt	 23432	 23433	 O	 -1	 -1	 0
group	 data/source_txt/t7_government_rlacroix_202.txt	 23434	 23439	 O	 -1	 -1	 0
like	 data/source_txt/t7_government_rlacroix_202.txt	 23440	 23444	 O	 -1	 -1	 0
the	 data/source_txt/t7_government_rlacroix_202.txt	 23445	 23448	 B-Term	 T36	 0	 AKA
American	 data/source_txt/t7_government_rlacroix_202.txt	 23449	 23457	 I-Term	 T36	 0	 AKA
Civil	 data/source_txt/t7_government_rlacroix_202.txt	 23458	 23463	 I-Term	 T36	 0	 AKA
Liberties	 data/source_txt/t7_government_rlacroix_202.txt	 23464	 23473	 I-Term	 T36	 0	 AKA
Union	 data/source_txt/t7_government_rlacroix_202.txt	 23474	 23479	 I-Term	 T36	 0	 AKA
(	 data/source_txt/t7_government_rlacroix_202.txt	 23480	 23481	 O	 -1	 -1	 0
ACLU	 data/source_txt/t7_government_rlacroix_202.txt	 23481	 23485	 B-Alias-Term	 T37	 T36	 AKA
)	 data/source_txt/t7_government_rlacroix_202.txt	 23485	 23486	 O	 -1	 -1	 0
because	 data/source_txt/t7_government_rlacroix_202.txt	 23487	 23494	 O	 -1	 -1	 0
it	 data/source_txt/t7_government_rlacroix_202.txt	 23495	 23497	 O	 -1	 -1	 0
supports	 data/source_txt/t7_government_rlacroix_202.txt	 23498	 23506	 O	 -1	 -1	 0
the	 data/source_txt/t7_government_rlacroix_202.txt	 23507	 23510	 O	 -1	 -1	 0
liberties	 data/source_txt/t7_government_rlacroix_202.txt	 23511	 23520	 O	 -1	 -1	 0
guaranteed	 data/source_txt/t7_government_rlacroix_202.txt	 23521	 23531	 O	 -1	 -1	 0
in	 data/source_txt/t7_government_rlacroix_202.txt	 23532	 23534	 O	 -1	 -1	 0
the	 data/source_txt/t7_government_rlacroix_202.txt	 23535	 23538	 O	 -1	 -1	 0
U.S.	 data/source_txt/t7_government_rlacroix_202.txt	 23539	 23543	 O	 -1	 -1	 0
Constitution	 data/source_txt/t7_government_rlacroix_202.txt	 23544	 23556	 O	 -1	 -1	 0
,	 data/source_txt/t7_government_rlacroix_202.txt	 23556	 23557	 O	 -1	 -1	 0
even	 data/source_txt/t7_government_rlacroix_202.txt	 23558	 23562	 O	 -1	 -1	 0
the	 data/source_txt/t7_government_rlacroix_202.txt	 23563	 23566	 O	 -1	 -1	 0
free	 data/source_txt/t7_government_rlacroix_202.txt	 23567	 23571	 O	 -1	 -1	 0
expression	 data/source_txt/t7_government_rlacroix_202.txt	 23572	 23582	 O	 -1	 -1	 0
of	 data/source_txt/t7_government_rlacroix_202.txt	 23583	 23585	 O	 -1	 -1	 0
unpopular	 data/source_txt/t7_government_rlacroix_202.txt	 23586	 23595	 O	 -1	 -1	 0
views.https://www.aclu.org/	 data/source_txt/t7_government_rlacroix_202.txt	 23596	 23623	 O	 -1	 -1	 0
(	 data/source_txt/t7_government_rlacroix_202.txt	 23624	 23625	 O	 -1	 -1	 0
March	 data/source_txt/t7_government_rlacroix_202.txt	 23625	 23630	 O	 -1	 -1	 0
1	 data/source_txt/t7_government_rlacroix_202.txt	 23631	 23632	 O	 -1	 -1	 0
,	 data/source_txt/t7_government_rlacroix_202.txt	 23632	 23633	 O	 -1	 -1	 0
2016	 data/source_txt/t7_government_rlacroix_202.txt	 23634	 23638	 O	 -1	 -1	 0
)	 data/source_txt/t7_government_rlacroix_202.txt	 23638	 23639	 O	 -1	 -1	 0
.	 data/source_txt/t7_government_rlacroix_202.txt	 23639	 23640	 O	 -1	 -1	 0

Lines 3908-3951. Error in line 3944.

The	 data/source_txt/t7_government_rlacroix_202.txt	 47189	 47192	 O	 -1	 -1	 0
Republican	 data/source_txt/t7_government_rlacroix_202.txt	 47193	 47203	 O	 -1	 -1	 0
Senate	 data/source_txt/t7_government_rlacroix_202.txt	 47204	 47210	 O	 -1	 -1	 0
and	 data/source_txt/t7_government_rlacroix_202.txt	 47211	 47214	 O	 -1	 -1	 0
Judiciary	 data/source_txt/t7_government_rlacroix_202.txt	 47215	 47224	 O	 -1	 -1	 0
Committee	 data/source_txt/t7_government_rlacroix_202.txt	 47225	 47234	 O	 -1	 -1	 0
will	 data/source_txt/t7_government_rlacroix_202.txt	 47235	 47239	 O	 -1	 -1	 0
welcome	 data/source_txt/t7_government_rlacroix_202.txt	 47240	 47247	 O	 -1	 -1	 0
a	 data/source_txt/t7_government_rlacroix_202.txt	 47248	 47249	 O	 -1	 -1	 0
Trump	 data/source_txt/t7_government_rlacroix_202.txt	 47250	 47255	 O	 -1	 -1	 0
nominee	 data/source_txt/t7_government_rlacroix_202.txt	 47256	 47263	 O	 -1	 -1	 0
in	 data/source_txt/t7_government_rlacroix_202.txt	 47264	 47266	 O	 -1	 -1	 0
early	 data/source_txt/t7_government_rlacroix_202.txt	 47267	 47272	 O	 -1	 -1	 0
2017.Other	 data/source_txt/t7_government_rlacroix_202.txt	 47273	 47283	 O	 -1	 -1	 0
presidential	 data/source_txt/t7_government_rlacroix_202.txt	 47284	 47296	 O	 -1	 -1	 0
selections	 data/source_txt/t7_government_rlacroix_202.txt	 47297	 47307	 O	 -1	 -1	 0
are	 data/source_txt/t7_government_rlacroix_202.txt	 47308	 47311	 O	 -1	 -1	 0
not	 data/source_txt/t7_government_rlacroix_202.txt	 47312	 47315	 O	 -1	 -1	 0
subject	 data/source_txt/t7_government_rlacroix_202.txt	 47316	 47323	 O	 -1	 -1	 0
to	 data/source_txt/t7_government_rlacroix_202.txt	 47324	 47326	 O	 -1	 -1	 0
Senate	 data/source_txt/t7_government_rlacroix_202.txt	 47327	 47333	 O	 -1	 -1	 0
approval	 data/source_txt/t7_government_rlacroix_202.txt	 47334	 47342	 O	 -1	 -1	 0
,	 data/source_txt/t7_government_rlacroix_202.txt	 47342	 47343	 O	 -1	 -1	 0
including	 data/source_txt/t7_government_rlacroix_202.txt	 47344	 47353	 O	 -1	 -1	 0
the	 data/source_txt/t7_government_rlacroix_202.txt	 47354	 47357	 O	 -1	 -1	 0
president	 data/source_txt/t7_government_rlacroix_202.txt	 47358	 47367	 O	 -1	 -1	 0
’s	 data/source_txt/t7_government_rlacroix_202.txt	 47367	 47369	 O	 -1	 -1	 0
personal	 data/source_txt/t7_government_rlacroix_202.txt	 47370	 47378	 O	 -1	 -1	 0
staff	 data/source_txt/t7_government_rlacroix_202.txt	 47379	 47384	 O	 -1	 -1	 0
(	 data/source_txt/t7_government_rlacroix_202.txt	 47385	 47386	 O	 -1	 -1	 0
whose	 data/source_txt/t7_government_rlacroix_202.txt	 47386	 47391	 O	 -1	 -1	 0
most	 data/source_txt/t7_government_rlacroix_202.txt	 47392	 47396	 O	 -1	 -1	 0
important	 data/source_txt/t7_government_rlacroix_202.txt	 47397	 47406	 O	 -1	 -1	 0
member	 data/source_txt/t7_government_rlacroix_202.txt	 47407	 47413	 O	 -1	 -1	 0
is	 data/source_txt/t7_government_rlacroix_202.txt	 47414	 47416	 O	 -1	 -1	 0
the	 data/source_txt/t7_government_rlacroix_202.txt	 47417	 47420	 O	 -1	 -1	 0
White	 data/source_txt/t7_government_rlacroix_202.txt	 47421	 47426	 O	 -1	 -1	 0
House	 data/source_txt/t7_government_rlacroix_202.txt	 47427	 47432	 O	 -1	 -1	 0
chief	 data/source_txt/t7_government_rlacroix_202.txt	 47433	 47438	 O	 -1	 -1	 0
of	 data/source_txt/t7_government_rlacroix_202.txt	 47439	 47441	 O	 -1	 -1	 0
staff	 data/source_txt/t7_government_rlacroix_202.txt	 47442	 47447	 O	 -1	 -1	 0
)	 data/source_txt/t7_government_rlacroix_202.txt	 47447	 47448	 O	 -1	 -1	 0
and	 data/source_txt/t7_government_rlacroix_202.txt	 47449	 47452	 O	 -1	 -1	 0
various	 data/source_txt/t7_government_rlacroix_202.txt	 47453	 47460	 O	 -1	 -1	 0
advisers	 data/source_txt/t7_government_rlacroix_202.txt	 47461	 47469	 O	 -1	 -1	 0
(	 data/source_txt/t7_government_rlacroix_202.txt	 47470	 47471	 O	 -1	 -1	 0
most	 data/source_txt/t7_government_rlacroix_202.txt	 47471	 47475	 O	 -1	 -1	 0
notably	 data/source_txt/t7_government_rlacroix_202.txt	 47476	 47483	 O	 -1	 -1	 0
the	 data/source_txt/t7_government_rlacroix_202.txt	 47484	 47487	 O	 -1	 -1	 0
national	 data/source_txt/t7_government_rlacroix_202.txt	 47488	 47496	 O	 -1	 -1	 0
security	 data/source_txt/t7_government_rlacroix_202.txt	 47497	 47505	 O	 -1	 -1	 0
adviser	 data/source_txt/t7_government_rlacroix_202.txt	 47506	 47513	 O	 -1	 -1	 0
)	 data/source_txt/t7_government_rlacroix_202.txt	 47513	 47514	 O	 -1	 -1	 0
.	 data/source_txt/t7_government_rlacroix_202.txt	 47514	 47515	 O	 -1	 -1	 0

Lines 8260-8313. Error in line 8273.

Add documentation mapping IOB tags with the DEFT paper's Tables 2 and 3

It could be interesting to have some documentation mapping IOB tags with the DEFT paper's Tables 2 and 3. E.g.

DNA /Users/sspala/dev/definition_extraction/textbook_sentences/adjudication_files_082219_FINAL/ksun/biology/t1_biology_jlee_0.txt 17742 17745 B-Definiti-frag T123-frag T123 fragment
has the IOB tag B-Definiti-frag, which might not be obvious to link to DEFT paper's Tables 2

[TOKENIZATION] #9

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t4_psychology_1_0.deft

Content

In	 data/source_txt/t4_psychology_mkaplan_0.txt	 31686	 31688	 O	 -1	 -1	 0
central	 data/source_txt/t4_psychology_mkaplan_0.txt	 31689	 31696	 B-Term	 T189	 0	 Direct-Defines
sleep	 data/source_txt/t4_psychology_mkaplan_0.txt	 31697	 31702	 I-Term	 T189	 0	 Direct-Defines
apnea	 data/source_txt/t4_psychology_mkaplan_0.txt	 31703	 31708	 I-Term	 T189	 0	 Direct-Defines
,	 data/source_txt/t4_psychology_mkaplan_0.txt	 31708	 31709	 O	 -1	 -1	 0
disruption	 data/source_txt/t4_psychology_mkaplan_0.txt	 31710	 31720	 B-Definition	 T190	 T189	 Direct-Defines
in	 data/source_txt/t4_psychology_mkaplan_0.txt	 31721	 31723	 I-Definition	 T190	 T189	 Direct-Defines
signals	 data/source_txt/t4_psychology_mkaplan_0.txt	 31724	 31731	 I-Definition	 T190	 T189	 Direct-Defines
sent	 data/source_txt/t4_psychology_mkaplan_0.txt	 31732	 31736	 I-Definition	 T190	 T189	 Direct-Defines
from	 data/source_txt/t4_psychology_mkaplan_0.txt	 31737	 31741	 I-Definition	 T190	 T189	 Direct-Defines
the	 data/source_txt/t4_psychology_mkaplan_0.txt	 31742	 31745	 I-Definition	 T190	 T189	 Direct-Defines
brain	 data/source_txt/t4_psychology_mkaplan_0.txt	 31746	 31751	 I-Definition	 T190	 T189	 Direct-Defines
that	 data/source_txt/t4_psychology_mkaplan_0.txt	 31752	 31756	 I-Definition	 T190	 T189	 Direct-Defines
regulate	 data/source_txt/t4_psychology_mkaplan_0.txt	 31757	 31765	 I-Definition	 T190	 T189	 Direct-Defines
breathing	 data/source_txt/t4_psychology_mkaplan_0.txt	 31766	 31775	 I-Definition	 T190	 T189	 Direct-Defines
cause	 data/source_txt/t4_psychology_mkaplan_0.txt	 31776	 31781	 I-Definition	 T190	 T189	 Direct-Defines
periods	 data/source_txt/t4_psychology_mkaplan_0.txt	 31782	 31789	 I-Definition	 T190	 T189	 Direct-Defines
of	 data/source_txt/t4_psychology_mkaplan_0.txt	 31790	 31792	 I-Definition	 T190	 T189	 Direct-Defines
interrupted	 data/source_txt/t4_psychology_mkaplan_0.txt	 31793	 31804	 I-Definition	 T190	 T189	 Direct-Defines
breathing	 data/source_txt/t4_psychology_mkaplan_0.txt	 31805	 31814	 I-Definition	 T190	 T189	 Direct-Defines
(	 data/source_txt/t4_psychology_mkaplan_0.txt	 31815	 31816	 O	 -1	 -1	 0
White	 data/source_txt/t4_psychology_mkaplan_0.txt	 31816	 31821	 O	 -1	 -1	 0
,	 data/source_txt/t4_psychology_mkaplan_0.txt	 31821	 31822	 O	 -1	 -1	 0
2005)	 data/source_txt/t4_psychology_mkaplan_0.txt	 31823	 31828	 O	 -1	 -1	 0
.	 data/source_txt/t4_psychology_mkaplan_0.txt	 31828	 31829	 O	 -1	 -1	 0

One	 data/source_txt/t4_psychology_mkaplan_0.txt	 31829	 31832	 O	 -1	 -1	 0
of	 data/source_txt/t4_psychology_mkaplan_0.txt	 31833	 31835	 O	 -1	 -1	 0
the	 data/source_txt/t4_psychology_mkaplan_0.txt	 31836	 31839	 O	 -1	 -1	 0
most	 data/source_txt/t4_psychology_mkaplan_0.txt	 31840	 31844	 O	 -1	 -1	 0
common	 data/source_txt/t4_psychology_mkaplan_0.txt	 31845	 31851	 O	 -1	 -1	 0
treatments	 data/source_txt/t4_psychology_mkaplan_0.txt	 31852	 31862	 O	 -1	 -1	 0
for	 data/source_txt/t4_psychology_mkaplan_0.txt	 31863	 31866	 O	 -1	 -1	 0
sleep	 data/source_txt/t4_psychology_mkaplan_0.txt	 31867	 31872	 O	 -1	 -1	 0
apnea	 data/source_txt/t4_psychology_mkaplan_0.txt	 31873	 31878	 O	 -1	 -1	 0
involves	 data/source_txt/t4_psychology_mkaplan_0.txt	 31879	 31887	 O	 -1	 -1	 0
the	 data/source_txt/t4_psychology_mkaplan_0.txt	 31888	 31891	 O	 -1	 -1	 0
use	 data/source_txt/t4_psychology_mkaplan_0.txt	 31892	 31895	 O	 -1	 -1	 0
of	 data/source_txt/t4_psychology_mkaplan_0.txt	 31896	 31898	 O	 -1	 -1	 0
a	 data/source_txt/t4_psychology_mkaplan_0.txt	 31899	 31900	 O	 -1	 -1	 0
special	 data/source_txt/t4_psychology_mkaplan_0.txt	 31901	 31908	 O	 -1	 -1	 0
device	 data/source_txt/t4_psychology_mkaplan_0.txt	 31909	 31915	 O	 -1	 -1	 0
during	 data/source_txt/t4_psychology_mkaplan_0.txt	 31916	 31922	 O	 -1	 -1	 0
sleep	 data/source_txt/t4_psychology_mkaplan_0.txt	 31923	 31928	 O	 -1	 -1	 0
.	 data/source_txt/t4_psychology_mkaplan_0.txt	 31928	 31929	 O	 -1	 -1	 0

Lines 5794-5838. Error in line 5817.

[TOKENIZATION] #10

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t1_biology_2_505.deft

Content

As	 data/source_txt/t1_biology_rlacroix_505.txt	 23587	 23589	 O	 -1	 -1	 0
illustrated	 data/source_txt/t1_biology_rlacroix_505.txt	 23590	 23601	 O	 -1	 -1	 0
in	 data/source_txt/t1_biology_rlacroix_505.txt	 23602	 23604	 O	 -1	 -1	 0
[	 data/source_txt/t1_biology_rlacroix_505.txt	 23605	 23606	 O	 -1	 -1	 0
link]a	 data/source_txt/t1_biology_rlacroix_505.txt	 23606	 23612	 O	 -1	 -1	 0
Fish	 data/source_txt/t1_biology_rlacroix_505.txt	 23613	 23617	 O	 -1	 -1	 0
have	 data/source_txt/t1_biology_rlacroix_505.txt	 23618	 23622	 O	 -1	 -1	 0
a	 data/source_txt/t1_biology_rlacroix_505.txt	 23623	 23624	 O	 -1	 -1	 0
single	 data/source_txt/t1_biology_rlacroix_505.txt	 23625	 23631	 O	 -1	 -1	 0
circuit	 data/source_txt/t1_biology_rlacroix_505.txt	 23632	 23639	 O	 -1	 -1	 0
for	 data/source_txt/t1_biology_rlacroix_505.txt	 23640	 23643	 O	 -1	 -1	 0
blood	 data/source_txt/t1_biology_rlacroix_505.txt	 23644	 23649	 O	 -1	 -1	 0
flow	 data/source_txt/t1_biology_rlacroix_505.txt	 23650	 23654	 O	 -1	 -1	 0
and	 data/source_txt/t1_biology_rlacroix_505.txt	 23655	 23658	 O	 -1	 -1	 0
a	 data/source_txt/t1_biology_rlacroix_505.txt	 23659	 23660	 O	 -1	 -1	 0
two	 data/source_txt/t1_biology_rlacroix_505.txt	 23661	 23664	 O	 -1	 -1	 0
-	 data/source_txt/t1_biology_rlacroix_505.txt	 23664	 23665	 O	 -1	 -1	 0
chambered	 data/source_txt/t1_biology_rlacroix_505.txt	 23665	 23674	 O	 -1	 -1	 0
heart	 data/source_txt/t1_biology_rlacroix_505.txt	 23675	 23680	 O	 -1	 -1	 0
that	 data/source_txt/t1_biology_rlacroix_505.txt	 23681	 23685	 O	 -1	 -1	 0
has	 data/source_txt/t1_biology_rlacroix_505.txt	 23686	 23689	 O	 -1	 -1	 0
only	 data/source_txt/t1_biology_rlacroix_505.txt	 23690	 23694	 O	 -1	 -1	 0
a	 data/source_txt/t1_biology_rlacroix_505.txt	 23695	 23696	 O	 -1	 -1	 0
single	 data/source_txt/t1_biology_rlacroix_505.txt	 23697	 23703	 O	 -1	 -1	 0
atrium	 data/source_txt/t1_biology_rlacroix_505.txt	 23704	 23710	 O	 -1	 -1	 0
and	 data/source_txt/t1_biology_rlacroix_505.txt	 23711	 23714	 O	 -1	 -1	 0
a	 data/source_txt/t1_biology_rlacroix_505.txt	 23715	 23716	 O	 -1	 -1	 0
single	 data/source_txt/t1_biology_rlacroix_505.txt	 23717	 23723	 O	 -1	 -1	 0
ventricle	 data/source_txt/t1_biology_rlacroix_505.txt	 23724	 23733	 O	 -1	 -1	 0
.	 data/source_txt/t1_biology_rlacroix_505.txt	 23733	 23734	 O	 -1	 -1	 0

Lines 4248-4277. Error in line 4252

Specify the list of the evaluated classes in the evaluation documentation

https://github.com/adobe-research/deft_corpus/blob/be215fecfa4c51e88498e8572e0c29bdb2246c3a/evaluation/README.md says:

Subtask 2: Sequence labeling We will report P/R/F1 for each evaluated class, as well as macro- and micro-averaged F1 for the evaluated classes. The official score will be based on the macro-averaged F1 of the evaluated classes.

Subtask 3: Relation extraction We will report P/R/F1 for each evaluated relation, as well as macro- and micro-averaged F1 for the evaluated relations. The official score will be based on the macro-averaged F1 of the evaluated relations.

We should specify the list of the evaluated classes/relations. Is it all classes in tables 2 and 3 in https://sigann.github.io/LAW-XIII-2019/pdf/W19-4015.pdf?

Data file formatting inconsistencies

While attempting to parse the corpus, I ran into a number of inconsistencies in terms of how the context windows are separated, etc. Here's the output of my code:

Training set

Extra sentence on line 4617 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_0_101.deft
Malformed context window separator on line 4877 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_0_101.deft
Malformed context window separator on line 5227 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_0_101.deft
Malformed context window separator on line 5322 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_0_101.deft
Extra sentence on line 3110 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_0_202.deft
Potential missing line-break on line 110 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_0_404.deft
Extra sentence on line 191 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_0_404.deft
Extra sentence on line 4818 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_101.deft
Potential missing line-break on line 2075 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_202.deft
Extra sentence on line 2174 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_202.deft
Suspiciously short sentence on line 4346 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_202.deft
Malformed context window separator on line 4352 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_202.deft
Potential missing line-break on line 134 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_303.deft
Extra sentence on line 213 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_303.deft
Suspiciously short sentence on line 629 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_1_606.deft
Malformed context window separator on line 1804 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_2_0.deft
Malformed context window separator on line 4688 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_2_101.deft
Malformed context window separator on line 5471 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_2_101.deft
Extra sentence on line 4113 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_2_303.deft
Suspiciously short sentence on line 4110 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_2_303.deft
Potential missing line-break on line 556 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_2_606.deft
Extra sentence on line 644 in file ..\..\deft_corpus\data\deft_files\train\t1_biology_2_606.deft
Potential missing line-break on line 251 in file ..\..\deft_corpus\data\deft_files\train\t2_history_1_0.deft
Malformed context window separator on line 6017 in file ..\..\deft_corpus\data\deft_files\train\t2_history_1_101.deft
Potential missing line-break on line 171 in file ..\..\deft_corpus\data\deft_files\train\t2_history_2_0.deft
Extra sentence on line 262 in file ..\..\deft_corpus\data\deft_files\train\t2_history_2_0.deft
Malformed context window separator on line 7322 in file ..\..\deft_corpus\data\deft_files\train\t2_history_2_0.deft
Extra sentence on line 2959 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_0.deft
Extra sentence on line 3769 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_0.deft
Extra sentence on line 5945 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_0.deft
Extra sentence on line 643 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_101.deft
Potential missing line-break on line 1033 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_101.deft
Extra sentence on line 1566 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_101.deft
Malformed context window separator on line 1935 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_101.deft
Malformed context window separator on line 4028 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_101.deft
Extra sentence on line 119 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_202.deft
Extra sentence on line 546 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_0_202.deft
Extra sentence on line 1336 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_0.deft
Malformed context window separator on line 3600 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_0.deft
Extra sentence on line 5856 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_0.deft
Potential missing line-break on line 370 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Extra sentence on line 1755 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Suspiciously short sentence on line 1818 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Malformed context window separator on line 2210 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Suspiciously short sentence on line 2250 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Malformed context window separator on line 3674 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Extra sentence on line 4650 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Suspiciously short sentence on line 4756 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Extra sentence on line 4852 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Suspiciously short sentence on line 5335 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_101.deft
Extra sentence on line 529 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_202.deft
Extra sentence on line 999 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_1_202.deft
Extra sentence on line 1456 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_0.deft
Extra sentence on line 1540 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_0.deft
Malformed context window separator on line 2374 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_0.deft
Malformed context window separator on line 3849 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_0.deft
Extra sentence on line 835 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_101.deft
Malformed context window separator on line 4355 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_101.deft
Extra sentence on line 4759 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_101.deft
Extra sentence on line 740 in file ..\..\deft_corpus\data\deft_files\train\t3_physics_2_202.deft
Extra sentence on line 101 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_0.deft
Extra sentence on line 1153 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_0.deft
Potential missing line-break on line 440 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_101.deft
Extra sentence on line 507 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_101.deft
Potential missing line-break on line 645 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_202.deft
Extra sentence on line 702 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_202.deft
Potential missing line-break on line 345 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_303.deft
Extra sentence on line 451 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_0_303.deft
Extra sentence on line 254 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_1_0.deft
Potential missing line-break on line 789 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_1_0.deft
Suspiciously short sentence on line 2442 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_1_0.deft
Extra sentence on line 3449 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_1_0.deft
Suspiciously short sentence on line 1136 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_1_101.deft
Suspiciously short sentence on line 5649 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_1_101.deft
Malformed context window separator on line 446 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_1_303.deft
Extra sentence on line 2644 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_2_101.deft
Extra sentence on line 2669 in file ..\..\deft_corpus\data\deft_files\train\t4_psychology_2_202.deft
Potential missing line-break on line 56 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_0_101.deft
Malformed context window separator on line 3823 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_0_101.deft
Malformed context window separator on line 3 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_0_202.deft
Extra sentence on line 5649 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_0_202.deft
Suspiciously short sentence on line 1604 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_1_0.deft
Potential missing line-break on line 4976 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_1_101.deft
Extra sentence on line 2050 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_1_202.deft
Malformed context window separator on line 2263 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_1_202.deft
Extra sentence on line 2798 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_2_0.deft
Extra sentence on line 175 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_2_101.deft
Extra sentence on line 3742 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_2_101.deft
Extra sentence on line 343 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_2_202.deft
Extra sentence on line 1682 in file ..\..\deft_corpus\data\deft_files\train\t5_economic_2_202.deft
Potential missing line-break on line 983 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_0_0.deft
Extra sentence on line 1091 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_0_0.deft
Malformed context window separator on line 1299 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_0_0.deft
Malformed context window separator on line 2425 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_0_0.deft
Malformed context window separator on line 391 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_1_0.deft
Extra sentence on line 811 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_1_0.deft
Malformed context window separator on line 2300 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_1_0.deft
Extra sentence on line 110 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_1_101.deft
Suspiciously short sentence on line 5941 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_2_0.deft
Malformed context window separator on line 4347 in file ..\..\deft_corpus\data\deft_files\train\t6_sociology_2_101.deft
Potential missing line-break on line 470 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 534 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 1124 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Potential missing line-break on line 3057 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 3566 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 4029 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Potential missing line-break on line 4181 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Potential missing line-break on line 4775 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 5387 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Potential missing line-break on line 5473 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 6515 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 7021 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_0.deft
Extra sentence on line 957 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Suspiciously short sentence on line 955 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 1312 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 1735 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 1772 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Potential missing line-break on line 2127 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 2259 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 2430 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 2484 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 2522 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 2550 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Potential missing line-break on line 2691 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Potential missing line-break on line 3022 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Potential missing line-break on line 3390 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Potential missing line-break on line 3677 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Potential missing line-break on line 3761 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 3834 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 4235 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 4337 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Malformed context window separator on line 4389 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Potential missing line-break on line 4589 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft      
Potential missing line-break on line 4661 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft      
Potential missing line-break on line 4716 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft      
Potential missing line-break on line 4780 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft      
Potential missing line-break on line 4828 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft      
Potential missing line-break on line 5216 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 5384 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 5481 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 5505 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 5544 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 5737 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 6321 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 6667 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 7738 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 7796 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_101.deft
Extra sentence on line 108 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 314 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Suspiciously short sentence on line 836 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Potential missing line-break on line 895 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 1176 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Potential missing line-break on line 1434 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 1868 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 1905 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Potential missing line-break on line 2825 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 3085 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Potential missing line-break on line 3128 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Potential missing line-break on line 3452 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 3656 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 3754 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 3785 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 4389 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 4586 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 4672 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 5418 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Suspiciously short sentence on line 5498 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 6113 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Potential missing line-break on line 6593 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 6731 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 7265 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 7566 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 7592 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_202.deft
Extra sentence on line 674 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_303.deft
Extra sentence on line 2627 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_303.deft
Extra sentence on line 3097 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_303.deft
Extra sentence on line 3641 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_303.deft
Extra sentence on line 3867 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_303.deft
Extra sentence on line 3956 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_303.deft
Potential missing line-break on line 4290 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_303.deft
Extra sentence on line 223 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_404.deft
Extra sentence on line 1546 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_404.deft
Extra sentence on line 1786 in file ..\..\deft_corpus\data\deft_files\train\t7_government_0_404.deft
Extra sentence on line 669 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 863 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Potential missing line-break on line 914 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Suspiciously short sentence on line 2193 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Potential missing line-break on line 2754 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Potential missing line-break on line 3177 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 3773 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Suspiciously short sentence on line 4868 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 5227 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Potential missing line-break on line 5318 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 5551 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 6378 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 6889 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 7652 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 7801 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 8062 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 8106 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_0.deft
Extra sentence on line 263 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 417 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Malformed context window separator on line 512 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 1404 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 1469 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 1525 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 1966 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 2649 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 3780 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 3805 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 4050 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 6265 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Potential missing line-break on line 6384 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 6474 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Potential missing line-break on line 6548 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Potential missing line-break on line 6891 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 7107 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 7602 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 7752 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 7778 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 7913 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 8082 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 8110 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 8160 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 9342 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_101.deft
Extra sentence on line 506 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 1025 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 1068 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 1372 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 1561 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 1598 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 2510 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 2959 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Potential missing line-break on line 3204 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 3479 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 3800 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 4025 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 4249 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 4713 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Potential missing line-break on line 4772 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 4999 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 5027 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 6853 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 7425 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Potential missing line-break on line 7934 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 8147 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Potential missing line-break on line 8327 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 8732 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_202.deft
Extra sentence on line 1325 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 1637 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 2225 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 2701 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 3027 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 3036 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 3049 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 4627 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Potential missing line-break on line 5160 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 5489 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 5632 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 6227 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 7134 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Suspiciously short sentence on line 7170 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Extra sentence on line 7581 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_303.deft
Malformed context window separator on line 60 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_404.deft
Extra sentence on line 276 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_404.deft
Extra sentence on line 304 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_404.deft
Potential missing line-break on line 1292 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_404.deft
Potential missing line-break on line 1349 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_404.deft
Extra sentence on line 1697 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_404.deft
Potential missing line-break on line 1827 in file ..\..\deft_corpus\data\deft_files\train\t7_government_1_404.deft
Extra sentence on line 809 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Potential missing line-break on line 2091 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 3422 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 3729 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 3862 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 3901 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 4276 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 4642 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 4841 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 4987 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Potential missing line-break on line 6109 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 6333 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 6366 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 6391 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Potential missing line-break on line 6807 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 7020 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 8427 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 8860 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 8894 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_0.deft
Extra sentence on line 240 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 362 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 403 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 1247 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 2004 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 3496 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Potential missing line-break on line 3580 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 3898 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Potential missing line-break on line 4079 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Potential missing line-break on line 5286 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 5362 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 5870 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 5891 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 6158 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 6906 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 6927 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 7156 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_101.deft
Extra sentence on line 223 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 252 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 441 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 792 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 828 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 1036 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 1294 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 1365 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 1472 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 1492 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 1978 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 1981 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Suspiciously short sentence on line 1978 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 2942 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 3363 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 3718 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 4227 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 4449 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 5209 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 5272 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 5463 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 5967 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 6147 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 6207 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 6689 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 7132 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 7537 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Potential missing line-break on line 7621 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_202.deft
Extra sentence on line 1066 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 1255 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 1852 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Potential missing line-break on line 2299 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Potential missing line-break on line 2334 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Malformed context window separator on line 3512 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 3882 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 4087 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 4684 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Potential missing line-break on line 5191 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 5322 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Potential missing line-break on line 5496 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Potential missing line-break on line 6558 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 6790 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 7237 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_303.deft
Extra sentence on line 266 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_404.deft
Potential missing line-break on line 1504 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_404.deft
Extra sentence on line 1567 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_404.deft
Extra sentence on line 1806 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_404.deft
Extra sentence on line 2095 in file ..\..\deft_corpus\data\deft_files\train\t7_government_2_404.deft

Dev set:

Extra sentence on line 260 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_0.deft
Potential missing line-break on line 420 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_0.deft
Extra sentence on line 540 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_0.deft
Potential missing line-break on line 688 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_0.deft
Extra sentence on line 768 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_0.deft
Extra sentence on line 369 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_101.deft
Extra sentence on line 425 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_101.deft
Extra sentence on line 540 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_101.deft
Extra sentence on line 882 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_0_101.deft
Extra sentence on line 128 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_1_0.deft
Extra sentence on line 146 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_1_0.deft
Extra sentence on line 164 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_1_0.deft
Potential missing line-break on line 256 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_1_0.deft
Extra sentence on line 408 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_1_0.deft
Suspiciously short sentence on line 415 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_1_303.deft
Extra sentence on line 458 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_1_303.deft
Extra sentence on line 212 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_2_101.deft
Potential missing line-break on line 401 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_2_101.deft
Potential missing line-break on line 281 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_2_202.deft
Potential missing line-break on line 337 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_2_202.deft
Potential missing line-break on line 25 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_2_303.deft
Extra sentence on line 470 in file ..\..\deft_corpus\data\deft_files\dev\t7_government_2_303.deft

Strange sentences in the parsed files for task 1

On working on the first task, I have noticed that some sentences are composed of less than 5 number of words.
After checking the parsing script, I couldn't really understand your idea of splitting the conll file into sentences using these regexps.

if re.match('^\s+$', line) and len(new_sentence) > 0 and not re.match(r'^\s*\d+\s*\.$', new_sentence):

One of the strange sentences is line 45 in the parsed t7_government_1_404.deft file:
" 1993 . 7073 ." "0"

From the task's point of view, these sentences aren't definitions. But I am not sure whether this was done on purpose or not.

Thanks a lot.

Deft files seem to refer to source txt files that don't exist

E.g.the first line of https://github.com/adobe-research/deft_corpus/blob/master/data/deft_files/dev/t1_biology_0_0.deft#L1:

2	 /Users/sspala/dev/definition_extraction/textbook_sentences/adjudication_files_082219_FINAL/ksun/biology/t1_biology_jlee_0.txt	 0	 1	 O	 -1	 -1	 0

-> it refers to the file t1_biology_jlee_0.txt

However on https://github.com/adobe-research/deft_corpus/tree/master/data/source_txt/dev the files don't contain the annotator names.

[TOKENIZATION] #5

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t5_economic_0_0.deft

Content

In	 data/source_txt/t5_economic_jlee_0.txt	 9719	 9721	 O	 -1	 -1	 0
this	 data/source_txt/t5_economic_jlee_0.txt	 9722	 9726	 O	 -1	 -1	 0
case	 data/source_txt/t5_economic_jlee_0.txt	 9727	 9731	 O	 -1	 -1	 0
,	 data/source_txt/t5_economic_jlee_0.txt	 9731	 9732	 O	 -1	 -1	 0
the	 data/source_txt/t5_economic_jlee_0.txt	 9733	 9736	 O	 -1	 -1	 0
addition	 data/source_txt/t5_economic_jlee_0.txt	 9737	 9745	 O	 -1	 -1	 0
of	 data/source_txt/t5_economic_jlee_0.txt	 9746	 9748	 O	 -1	 -1	 0
still	 data/source_txt/t5_economic_jlee_0.txt	 9749	 9754	 O	 -1	 -1	 0
more	 data/source_txt/t5_economic_jlee_0.txt	 9755	 9759	 O	 -1	 -1	 0
barbers	 data/source_txt/t5_economic_jlee_0.txt	 9760	 9767	 O	 -1	 -1	 0
would	 data/source_txt/t5_economic_jlee_0.txt	 9768	 9773	 O	 -1	 -1	 0
actually	 data/source_txt/t5_economic_jlee_0.txt	 9774	 9782	 O	 -1	 -1	 0
cause	 data/source_txt/t5_economic_jlee_0.txt	 9783	 9788	 O	 -1	 -1	 0
output	 data/source_txt/t5_economic_jlee_0.txt	 9789	 9795	 O	 -1	 -1	 0
to	 data/source_txt/t5_economic_jlee_0.txt	 9796	 9798	 O	 -1	 -1	 0
decrease	 data/source_txt/t5_economic_jlee_0.txt	 9799	 9807	 O	 -1	 -1	 0
,	 data/source_txt/t5_economic_jlee_0.txt	 9807	 9808	 O	 -1	 -1	 0
as	 data/source_txt/t5_economic_jlee_0.txt	 9809	 9811	 O	 -1	 -1	 0
shown	 data/source_txt/t5_economic_jlee_0.txt	 9812	 9817	 O	 -1	 -1	 0
in	 data/source_txt/t5_economic_jlee_0.txt	 9818	 9820	 O	 -1	 -1	 0
the	 data/source_txt/t5_economic_jlee_0.txt	 9821	 9824	 O	 -1	 -1	 0
last	 data/source_txt/t5_economic_jlee_0.txt	 9825	 9829	 O	 -1	 -1	 0
row	 data/source_txt/t5_economic_jlee_0.txt	 9830	 9833	 O	 -1	 -1	 0
of	 data/source_txt/t5_economic_jlee_0.txt	 9834	 9836	 O	 -1	 -1	 0
[	 data/source_txt/t5_economic_jlee_0.txt	 9837	 9838	 O	 -1	 -1	 0
link].This	 data/source_txt/t5_economic_jlee_0.txt	 9838	 9848	 O	 -1	 -1	 0
pattern	 data/source_txt/t5_economic_jlee_0.txt	 9849	 9856	 O	 -1	 -1	 0
of	 data/source_txt/t5_economic_jlee_0.txt	 9857	 9859	 O	 -1	 -1	 0
diminishing	 data/source_txt/t5_economic_jlee_0.txt	 9860	 9871	 O	 -1	 -1	 0
marginal	 data/source_txt/t5_economic_jlee_0.txt	 9872	 9880	 O	 -1	 -1	 0
returns	 data/source_txt/t5_economic_jlee_0.txt	 9881	 9888	 O	 -1	 -1	 0
is	 data/source_txt/t5_economic_jlee_0.txt	 9889	 9891	 O	 -1	 -1	 0
common	 data/source_txt/t5_economic_jlee_0.txt	 9892	 9898	 O	 -1	 -1	 0
in	 data/source_txt/t5_economic_jlee_0.txt	 9899	 9901	 O	 -1	 -1	 0
production	 data/source_txt/t5_economic_jlee_0.txt	 9902	 9912	 O	 -1	 -1	 0
.	 data/source_txt/t5_economic_jlee_0.txt	 9912	 9913	 O	 -1	 -1	 0

Lines 1456-1491. Error in line 1481.

This	 data/source_txt/t5_economic_jlee_0.txt	 10467	 10471	 O	 -1	 -1	 0
pattern	 data/source_txt/t5_economic_jlee_0.txt	 10472	 10479	 O	 -1	 -1	 0
was	 data/source_txt/t5_economic_jlee_0.txt	 10480	 10483	 O	 -1	 -1	 0
illustrated	 data/source_txt/t5_economic_jlee_0.txt	 10484	 10495	 O	 -1	 -1	 0
earlier	 data/source_txt/t5_economic_jlee_0.txt	 10496	 10503	 O	 -1	 -1	 0
in	 data/source_txt/t5_economic_jlee_0.txt	 10504	 10506	 O	 -1	 -1	 0
[	 data/source_txt/t5_economic_jlee_0.txt	 10507	 10508	 O	 -1	 -1	 0
link].In	 data/source_txt/t5_economic_jlee_0.txt	 10508	 10516	 O	 -1	 -1	 0
the	 data/source_txt/t5_economic_jlee_0.txt	 10517	 10520	 O	 -1	 -1	 0
middle	 data/source_txt/t5_economic_jlee_0.txt	 10521	 10527	 O	 -1	 -1	 0
portion	 data/source_txt/t5_economic_jlee_0.txt	 10528	 10535	 O	 -1	 -1	 0
of	 data/source_txt/t5_economic_jlee_0.txt	 10536	 10538	 O	 -1	 -1	 0
the	 data/source_txt/t5_economic_jlee_0.txt	 10539	 10542	 O	 -1	 -1	 0
long	 data/source_txt/t5_economic_jlee_0.txt	 10543	 10547	 O	 -1	 -1	 0
-	 data/source_txt/t5_economic_jlee_0.txt	 10547	 10548	 O	 -1	 -1	 0
run	 data/source_txt/t5_economic_jlee_0.txt	 10548	 10551	 O	 -1	 -1	 0
average	 data/source_txt/t5_economic_jlee_0.txt	 10552	 10559	 O	 -1	 -1	 0
cost	 data/source_txt/t5_economic_jlee_0.txt	 10560	 10564	 O	 -1	 -1	 0
curve	 data/source_txt/t5_economic_jlee_0.txt	 10565	 10570	 O	 -1	 -1	 0
,	 data/source_txt/t5_economic_jlee_0.txt	 10570	 10571	 O	 -1	 -1	 0
the	 data/source_txt/t5_economic_jlee_0.txt	 10572	 10575	 O	 -1	 -1	 0
flat	 data/source_txt/t5_economic_jlee_0.txt	 10576	 10580	 O	 -1	 -1	 0
portion	 data/source_txt/t5_economic_jlee_0.txt	 10581	 10588	 O	 -1	 -1	 0
of	 data/source_txt/t5_economic_jlee_0.txt	 10589	 10591	 O	 -1	 -1	 0
the	 data/source_txt/t5_economic_jlee_0.txt	 10592	 10595	 O	 -1	 -1	 0
curve	 data/source_txt/t5_economic_jlee_0.txt	 10596	 10601	 O	 -1	 -1	 0
around	 data/source_txt/t5_economic_jlee_0.txt	 10602	 10608	 O	 -1	 -1	 0
Q3	 data/source_txt/t5_economic_jlee_0.txt	 10609	 10611	 O	 -1	 -1	 0
,	 data/source_txt/t5_economic_jlee_0.txt	 10611	 10612	 O	 -1	 -1	 0
economies	 data/source_txt/t5_economic_jlee_0.txt	 10613	 10622	 O	 -1	 -1	 0
of	 data/source_txt/t5_economic_jlee_0.txt	 10623	 10625	 O	 -1	 -1	 0
scale	 data/source_txt/t5_economic_jlee_0.txt	 10626	 10631	 O	 -1	 -1	 0
have	 data/source_txt/t5_economic_jlee_0.txt	 10632	 10636	 O	 -1	 -1	 0
been	 data/source_txt/t5_economic_jlee_0.txt	 10637	 10641	 O	 -1	 -1	 0
exhausted	 data/source_txt/t5_economic_jlee_0.txt	 10642	 10651	 O	 -1	 -1	 0
.	 data/source_txt/t5_economic_jlee_0.txt	 10651	 10652	 O	 -1	 -1	 0

Lines 1567-1602. Error in line 1574.

[TOKENIZATION] #2

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t1_biology_1_101.deft

Content

Sturtevant	 data/source_txt/t1_biology_mkaplan_101.txt	 20151	 20161	 O	 -1	 -1	 0
divided	 data/source_txt/t1_biology_mkaplan_101.txt	 20162	 20169	 O	 -1	 -1	 0
his	 data/source_txt/t1_biology_mkaplan_101.txt	 20170	 20173	 O	 -1	 -1	 0
genetic	 data/source_txt/t1_biology_mkaplan_101.txt	 20174	 20181	 O	 -1	 -1	 0
map	 data/source_txt/t1_biology_mkaplan_101.txt	 20182	 20185	 O	 -1	 -1	 0
into	 data/source_txt/t1_biology_mkaplan_101.txt	 20186	 20190	 O	 -1	 -1	 0
map	 data/source_txt/t1_biology_mkaplan_101.txt	 20191	 20194	 B-Qualifier	 T151	 T150	 Supplements
units	 data/source_txt/t1_biology_mkaplan_101.txt	 20195	 20200	 I-Qualifier	 T151	 T150	 Supplements
,	 data/source_txt/t1_biology_mkaplan_101.txt	 20200	 20201	 O	 -1	 -1	 0
or	 data/source_txt/t1_biology_mkaplan_101.txt	 20202	 20204	 O	 -1	 -1	 0
centimorgans	 data/source_txt/t1_biology_mkaplan_101.txt	 20205	 20217	 B-Alias-Term	 T148	 T149	 AKA
(	 data/source_txt/t1_biology_mkaplan_101.txt	 20218	 20219	 O	 -1	 -1	 0
cM	 data/source_txt/t1_biology_mkaplan_101.txt	 20219	 20221	 B-Term	 T149	 0	 AKA
)	 data/source_txt/t1_biology_mkaplan_101.txt	 20221	 20222	 O	 -1	 -1	 0
,	 data/source_txt/t1_biology_mkaplan_101.txt	 20222	 20223	 O	 -1	 -1	 0
in	 data/source_txt/t1_biology_mkaplan_101.txt	 20224	 20226	 O	 -1	 -1	 0
which	 data/source_txt/t1_biology_mkaplan_101.txt	 20227	 20232	 O	 -1	 -1	 0
a	 data/source_txt/t1_biology_mkaplan_101.txt	 20233	 20234	 B-Definition	 T150	 T149	 Direct-Defines
recombination	 data/source_txt/t1_biology_mkaplan_101.txt	 20235	 20248	 I-Definition	 T150	 T149	 Direct-Defines
frequency	 data/source_txt/t1_biology_mkaplan_101.txt	 20249	 20258	 I-Definition	 T150	 T149	 Direct-Defines
of	 data/source_txt/t1_biology_mkaplan_101.txt	 20259	 20261	 I-Definition	 T150	 T149	 Direct-Defines
0.01	 data/source_txt/t1_biology_mkaplan_101.txt	 20262	 20266	 I-Definition	 T150	 T149	 Direct-Defines
corresponds	 data/source_txt/t1_biology_mkaplan_101.txt	 20267	 20278	 I-Definition	 T150	 T149	 Direct-Defines
to	 data/source_txt/t1_biology_mkaplan_101.txt	 20279	 20281	 I-Definition	 T150	 T149	 Direct-Defines
1	 data/source_txt/t1_biology_mkaplan_101.txt	 20282	 20283	 I-Definition	 T150	 T149	 Direct-Defines
cM.By	 data/source_txt/t1_biology_mkaplan_101.txt	 20284	 20289	 Definition	 T150	 T149	 Direct-Defines
representing	 data/source_txt/t1_biology_mkaplan_101.txt	 20290	 20302	 O	 -1	 -1	 0
alleles	 data/source_txt/t1_biology_mkaplan_101.txt	 20303	 20310	 O	 -1	 -1	 0
in	 data/source_txt/t1_biology_mkaplan_101.txt	 20311	 20313	 O	 -1	 -1	 0
a	 data/source_txt/t1_biology_mkaplan_101.txt	 20314	 20315	 O	 -1	 -1	 0
linear	 data/source_txt/t1_biology_mkaplan_101.txt	 20316	 20322	 O	 -1	 -1	 0
map	 data/source_txt/t1_biology_mkaplan_101.txt	 20323	 20326	 O	 -1	 -1	 0
,	 data/source_txt/t1_biology_mkaplan_101.txt	 20326	 20327	 O	 -1	 -1	 0
Sturtevant	 data/source_txt/t1_biology_mkaplan_101.txt	 20328	 20338	 O	 -1	 -1	 0
suggested	 data/source_txt/t1_biology_mkaplan_101.txt	 20339	 20348	 O	 -1	 -1	 0
that	 data/source_txt/t1_biology_mkaplan_101.txt	 20349	 20353	 O	 -1	 -1	 0
genes	 data/source_txt/t1_biology_mkaplan_101.txt	 20354	 20359	 O	 -1	 -1	 0
can	 data/source_txt/t1_biology_mkaplan_101.txt	 20360	 20363	 O	 -1	 -1	 0
range	 data/source_txt/t1_biology_mkaplan_101.txt	 20364	 20369	 O	 -1	 -1	 0
from	 data/source_txt/t1_biology_mkaplan_101.txt	 20370	 20374	 O	 -1	 -1	 0
being	 data/source_txt/t1_biology_mkaplan_101.txt	 20375	 20380	 O	 -1	 -1	 0
perfectly	 data/source_txt/t1_biology_mkaplan_101.txt	 20381	 20390	 O	 -1	 -1	 0
linked	 data/source_txt/t1_biology_mkaplan_101.txt	 20391	 20397	 O	 -1	 -1	 0
(	 data/source_txt/t1_biology_mkaplan_101.txt	 20398	 20399	 O	 -1	 -1	 0
recombination	 data/source_txt/t1_biology_mkaplan_101.txt	 20399	 20412	 O	 -1	 -1	 0
frequency	 data/source_txt/t1_biology_mkaplan_101.txt	 20413	 20422	 O	 -1	 -1	 0
=	 data/source_txt/t1_biology_mkaplan_101.txt	 20423	 20424	 O	 -1	 -1	 0
0	 data/source_txt/t1_biology_mkaplan_101.txt	 20425	 20426	 O	 -1	 -1	 0
)	 data/source_txt/t1_biology_mkaplan_101.txt	 20426	 20427	 O	 -1	 -1	 0
to	 data/source_txt/t1_biology_mkaplan_101.txt	 20428	 20430	 O	 -1	 -1	 0
being	 data/source_txt/t1_biology_mkaplan_101.txt	 20431	 20436	 O	 -1	 -1	 0
perfectly	 data/source_txt/t1_biology_mkaplan_101.txt	 20437	 20446	 O	 -1	 -1	 0
unlinked	 data/source_txt/t1_biology_mkaplan_101.txt	 20447	 20455	 O	 -1	 -1	 0
(	 data/source_txt/t1_biology_mkaplan_101.txt	 20456	 20457	 O	 -1	 -1	 0
recombination	 data/source_txt/t1_biology_mkaplan_101.txt	 20457	 20470	 O	 -1	 -1	 0
frequency	 data/source_txt/t1_biology_mkaplan_101.txt	 20471	 20480	 O	 -1	 -1	 0
=	 data/source_txt/t1_biology_mkaplan_101.txt	 20481	 20482	 O	 -1	 -1	 0
0.5	 data/source_txt/t1_biology_mkaplan_101.txt	 20483	 20486	 O	 -1	 -1	 0
)	 data/source_txt/t1_biology_mkaplan_101.txt	 20486	 20487	 O	 -1	 -1	 0
when	 data/source_txt/t1_biology_mkaplan_101.txt	 20488	 20492	 O	 -1	 -1	 0
genes	 data/source_txt/t1_biology_mkaplan_101.txt	 20493	 20498	 O	 -1	 -1	 0
are	 data/source_txt/t1_biology_mkaplan_101.txt	 20499	 20502	 O	 -1	 -1	 0
on	 data/source_txt/t1_biology_mkaplan_101.txt	 20503	 20505	 O	 -1	 -1	 0
different	 data/source_txt/t1_biology_mkaplan_101.txt	 20506	 20515	 O	 -1	 -1	 0
chromosomes	 data/source_txt/t1_biology_mkaplan_101.txt	 20516	 20527	 O	 -1	 -1	 0
or	 data/source_txt/t1_biology_mkaplan_101.txt	 20528	 20530	 O	 -1	 -1	 0
genes	 data/source_txt/t1_biology_mkaplan_101.txt	 20531	 20536	 O	 -1	 -1	 0
are	 data/source_txt/t1_biology_mkaplan_101.txt	 20537	 20540	 O	 -1	 -1	 0
separated	 data/source_txt/t1_biology_mkaplan_101.txt	 20541	 20550	 O	 -1	 -1	 0
very	 data/source_txt/t1_biology_mkaplan_101.txt	 20551	 20555	 O	 -1	 -1	 0
far	 data/source_txt/t1_biology_mkaplan_101.txt	 20556	 20559	 O	 -1	 -1	 0
apart	 data/source_txt/t1_biology_mkaplan_101.txt	 20560	 20565	 O	 -1	 -1	 0
on	 data/source_txt/t1_biology_mkaplan_101.txt	 20566	 20568	 O	 -1	 -1	 0
the	 data/source_txt/t1_biology_mkaplan_101.txt	 20569	 20572	 O	 -1	 -1	 0
same	 data/source_txt/t1_biology_mkaplan_101.txt	 20573	 20577	 O	 -1	 -1	 0
chromosome	 data/source_txt/t1_biology_mkaplan_101.txt	 20578	 20588	 O	 -1	 -1	 0
.	 data/source_txt/t1_biology_mkaplan_101.txt	 20588	 20589	 O	 -1	 -1	 0

Lines 3764-3840. Error in line 3789

Bug - handling last sentence in task1_converter.py

Hi

The conversion script task1_converter.py does not handle the last line in all the deft_files where they dont end in a blank line (which is all those in the train subdirectory). The code isn't checking for a new sentences concatenation after going through all the lines.

This brings up another question which is the corpus size in terms of sentences. I've not been able to match up with the figures in the paper against any of the sets of files in this repo, so i wanted to check how many sentences should there in fact be in total.

Thanks

Tony

[TOKENIZATION] Tokens with strange points and brackets #1

This is the first report on 47 found troubles in tokenization(only in train data)

Filepath

train/t1_biology_1_606.deft

Content

When	data/source_txt/train/t1_biology_1_606.txt	 21268	 21272	 O	 -1	 -1	 0
the	data/source_txt/train/t1_biology_1_606.txt	 21273	 21276	 O	 -1	 -1	 0
population	data/source_txt/train/t1_biology_1_606.txt	 21277	 21287	 O	 -1	 -1	 0
size	data/source_txt/train/t1_biology_1_606.txt	 21288	 21292	 O	 -1	 -1	 0
,	data/source_txt/train/t1_biology_1_606.txt	 21292	 21293	 O	 -1	 -1	 0
N	data/source_txt/train/t1_biology_1_606.txt	 21294	 21295	 O	 -1	 -1	 0
,	data/source_txt/train/t1_biology_1_606.txt	 21295	 21296	 O	 -1	 -1	 0
is	data/source_txt/train/t1_biology_1_606.txt	 21297	 21299	 O	 -1	 -1	 0
plotted	data/source_txt/train/t1_biology_1_606.txt	 21300	 21307	 O	 -1	 -1	 0
over	data/source_txt/train/t1_biology_1_606.txt	 21308	 21312	 O	 -1	 -1	 0
time	data/source_txt/train/t1_biology_1_606.txt	 21313	 21317	 O	 -1	 -1	 0
,	data/source_txt/train/t1_biology_1_606.txt	 21317	 21318	 O	 -1	 -1	 0
a	data/source_txt/train/t1_biology_1_606.txt	 21319	 21320	 O	 -1	 -1	 0
J	data/source_txt/train/t1_biology_1_606.txt	 21321	 21322	 O	 -1	 -1	 0
-	data/source_txt/train/t1_biology_1_606.txt	 21322	 21323	 O	 -1	 -1	 0
shaped	data/source_txt/train/t1_biology_1_606.txt	 21323	 21329	 O	 -1	 -1	 0
growth	data/source_txt/train/t1_biology_1_606.txt	 21330	 21336	 O	 -1	 -1	 0
curve	data/source_txt/train/t1_biology_1_606.txt	 21337	 21342	 O	 -1	 -1	 0
is	data/source_txt/train/t1_biology_1_606.txt	 21343	 21345	 O	 -1	 -1	 0
produced	data/source_txt/train/t1_biology_1_606.txt	 21346	 21354	 O	 -1	 -1	 0
(	data/source_txt/train/t1_biology_1_606.txt	 21355	 21356	 O	 -1	 -1	 0
[	data/source_txt/train/t1_biology_1_606.txt	 21356	 21357	 O	 -1	 -1	 0
link]).The	data/source_txt/train/t1_biology_1_606.txt	 21357	 21367	 O	 -1	 -1	 0
bacteria	data/source_txt/train/t1_biology_1_606.txt	 21368	 21376	 O	 -1	 -1	 0
example	data/source_txt/train/t1_biology_1_606.txt	 21377	 21384	 O	 -1	 -1	 0
is	data/source_txt/train/t1_biology_1_606.txt	 21385	 21387	 O	 -1	 -1	 0
not	data/source_txt/train/t1_biology_1_606.txt	 21388	 21391	 O	 -1	 -1	 0
representative	data/source_txt/train/t1_biology_1_606.txt	 21392	 21406	 O	 -1	 -1	 0
of	data/source_txt/train/t1_biology_1_606.txt	 21407	 21409	 O	 -1	 -1	 0
the	data/source_txt/train/t1_biology_1_606.txt	 21410	 21413	 O	 -1	 -1	 0
real	data/source_txt/train/t1_biology_1_606.txt	 21414	 21418	 O	 -1	 -1	 0
world	data/source_txt/train/t1_biology_1_606.txt	 21419	 21424	 O	 -1	 -1	 0
where	data/source_txt/train/t1_biology_1_606.txt	 21425	 21430	 O	 -1	 -1	 0
resources	data/source_txt/train/t1_biology_1_606.txt	 21431	 21440	 O	 -1	 -1	 0
are	data/source_txt/train/t1_biology_1_606.txt	 21441	 21444	 O	 -1	 -1	 0
limited	data/source_txt/train/t1_biology_1_606.txt	 21445	 21452	 O	 -1	 -1	 0
.	data/source_txt/train/t1_biology_1_606.txt	 21452	 21453	 O	 -1	 -1	 0

Lines 3522-3558, error in 3544

Additional Information

This mistake mixes two sentences

[TOKENIZATION] #7

Filepath

train/t4_psychology_2_303.deft

Content

In	data/source_txt/train/t4_psychology_2_303.txt	 12007	 12009	 O	 -1	 -1	 0
this	data/source_txt/train/t4_psychology_2_303.txt	 12010	 12014	 O	 -1	 -1	 0
dimension	data/source_txt/train/t4_psychology_2_303.txt	 12015	 12024	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12024	 12025	 O	 -1	 -1	 0
people	data/source_txt/train/t4_psychology_2_303.txt	 12026	 12032	 O	 -1	 -1	 0
who	data/source_txt/train/t4_psychology_2_303.txt	 12033	 12036	 O	 -1	 -1	 0
are	data/source_txt/train/t4_psychology_2_303.txt	 12037	 12040	 O	 -1	 -1	 0
high	data/source_txt/train/t4_psychology_2_303.txt	 12041	 12045	 O	 -1	 -1	 0
on	data/source_txt/train/t4_psychology_2_303.txt	 12046	 12048	 O	 -1	 -1	 0
psychoticism	data/source_txt/train/t4_psychology_2_303.txt	 12049	 12061	 O	 -1	 -1	 0
tend	data/source_txt/train/t4_psychology_2_303.txt	 12062	 12066	 O	 -1	 -1	 0
to	data/source_txt/train/t4_psychology_2_303.txt	 12067	 12069	 O	 -1	 -1	 0
be	data/source_txt/train/t4_psychology_2_303.txt	 12070	 12072	 O	 -1	 -1	 0
independent	data/source_txt/train/t4_psychology_2_303.txt	 12073	 12084	 O	 -1	 -1	 0
thinkers	data/source_txt/train/t4_psychology_2_303.txt	 12085	 12093	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12093	 12094	 O	 -1	 -1	 0
cold	data/source_txt/train/t4_psychology_2_303.txt	 12095	 12099	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12099	 12100	 O	 -1	 -1	 0
nonconformists	data/source_txt/train/t4_psychology_2_303.txt	 12101	 12115	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12115	 12116	 O	 -1	 -1	 0
impulsive	data/source_txt/train/t4_psychology_2_303.txt	 12117	 12126	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12126	 12127	 O	 -1	 -1	 0
antisocial	data/source_txt/train/t4_psychology_2_303.txt	 12128	 12138	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12138	 12139	 O	 -1	 -1	 0
and	data/source_txt/train/t4_psychology_2_303.txt	 12140	 12143	 O	 -1	 -1	 0
hostile	data/source_txt/train/t4_psychology_2_303.txt	 12144	 12151	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12151	 12152	 O	 -1	 -1	 0
whereas	data/source_txt/train/t4_psychology_2_303.txt	 12153	 12160	 O	 -1	 -1	 0
people	data/source_txt/train/t4_psychology_2_303.txt	 12161	 12167	 O	 -1	 -1	 0
who	data/source_txt/train/t4_psychology_2_303.txt	 12168	 12171	 O	 -1	 -1	 0
are	data/source_txt/train/t4_psychology_2_303.txt	 12172	 12175	 O	 -1	 -1	 0
high	data/source_txt/train/t4_psychology_2_303.txt	 12176	 12180	 O	 -1	 -1	 0
on	data/source_txt/train/t4_psychology_2_303.txt	 12181	 12183	 O	 -1	 -1	 0
superego	data/source_txt/train/t4_psychology_2_303.txt	 12184	 12192	 O	 -1	 -1	 0
control	data/source_txt/train/t4_psychology_2_303.txt	 12193	 12200	 O	 -1	 -1	 0
tend	data/source_txt/train/t4_psychology_2_303.txt	 12201	 12205	 O	 -1	 -1	 0
to	data/source_txt/train/t4_psychology_2_303.txt	 12206	 12208	 O	 -1	 -1	 0
have	data/source_txt/train/t4_psychology_2_303.txt	 12209	 12213	 O	 -1	 -1	 0
high	data/source_txt/train/t4_psychology_2_303.txt	 12214	 12218	 O	 -1	 -1	 0
impulse	data/source_txt/train/t4_psychology_2_303.txt	 12219	 12226	 O	 -1	 -1	 0
control	data/source_txt/train/t4_psychology_2_303.txt	 12227	 12234	 O	 -1	 -1	 0
—	data/source_txt/train/t4_psychology_2_303.txt	 12234	 12235	 O	 -1	 -1	 0
they	data/source_txt/train/t4_psychology_2_303.txt	 12235	 12239	 O	 -1	 -1	 0
are	data/source_txt/train/t4_psychology_2_303.txt	 12240	 12243	 O	 -1	 -1	 0
more	data/source_txt/train/t4_psychology_2_303.txt	 12244	 12248	 O	 -1	 -1	 0
altruistic	data/source_txt/train/t4_psychology_2_303.txt	 12249	 12259	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12259	 12260	 O	 -1	 -1	 0
empathetic	data/source_txt/train/t4_psychology_2_303.txt	 12261	 12271	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12271	 12272	 O	 -1	 -1	 0
cooperative	data/source_txt/train/t4_psychology_2_303.txt	 12273	 12284	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12284	 12285	 O	 -1	 -1	 0
and	data/source_txt/train/t4_psychology_2_303.txt	 12286	 12289	 O	 -1	 -1	 0
conventional	data/source_txt/train/t4_psychology_2_303.txt	 12290	 12302	 O	 -1	 -1	 0
(	data/source_txt/train/t4_psychology_2_303.txt	 12303	 12304	 O	 -1	 -1	 0
Eysenck	data/source_txt/train/t4_psychology_2_303.txt	 12304	 12311	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12311	 12312	 O	 -1	 -1	 0
Eysenck	data/source_txt/train/t4_psychology_2_303.txt	 12313	 12320	 O	 -1	 -1	 0
&	data/source_txt/train/t4_psychology_2_303.txt	 12321	 12322	 O	 -1	 -1	 0
Barrett	data/source_txt/train/t4_psychology_2_303.txt	 12323	 12330	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12330	 12331	 O	 -1	 -1	 0
1985).While	data/source_txt/train/t4_psychology_2_303.txt	 12332	 12343	 O	 -1	 -1	 0
Cattell	data/source_txt/train/t4_psychology_2_303.txt	 12344	 12351	 O	 -1	 -1	 0
’s	data/source_txt/train/t4_psychology_2_303.txt	 12351	 12353	 O	 -1	 -1	 0
16	data/source_txt/train/t4_psychology_2_303.txt	 12354	 12356	 O	 -1	 -1	 0
factors	data/source_txt/train/t4_psychology_2_303.txt	 12357	 12364	 O	 -1	 -1	 0
may	data/source_txt/train/t4_psychology_2_303.txt	 12365	 12368	 O	 -1	 -1	 0
be	data/source_txt/train/t4_psychology_2_303.txt	 12369	 12371	 O	 -1	 -1	 0
too	data/source_txt/train/t4_psychology_2_303.txt	 12372	 12375	 O	 -1	 -1	 0
broad	data/source_txt/train/t4_psychology_2_303.txt	 12376	 12381	 O	 -1	 -1	 0
,	data/source_txt/train/t4_psychology_2_303.txt	 12381	 12382	 O	 -1	 -1	 0
the	data/source_txt/train/t4_psychology_2_303.txt	 12383	 12386	 O	 -1	 -1	 0
Eysenck	data/source_txt/train/t4_psychology_2_303.txt	 12387	 12394	 O	 -1	 -1	 0
’s	data/source_txt/train/t4_psychology_2_303.txt	 12394	 12396	 O	 -1	 -1	 0
two	data/source_txt/train/t4_psychology_2_303.txt	 12397	 12400	 O	 -1	 -1	 0
-	data/source_txt/train/t4_psychology_2_303.txt	 12400	 12401	 O	 -1	 -1	 0
factor	data/source_txt/train/t4_psychology_2_303.txt	 12401	 12407	 O	 -1	 -1	 0
system	data/source_txt/train/t4_psychology_2_303.txt	 12408	 12414	 O	 -1	 -1	 0
has	data/source_txt/train/t4_psychology_2_303.txt	 12415	 12418	 O	 -1	 -1	 0
been	data/source_txt/train/t4_psychology_2_303.txt	 12419	 12423	 O	 -1	 -1	 0
criticized	data/source_txt/train/t4_psychology_2_303.txt	 12424	 12434	 O	 -1	 -1	 0
for	data/source_txt/train/t4_psychology_2_303.txt	 12435	 12438	 O	 -1	 -1	 0
being	data/source_txt/train/t4_psychology_2_303.txt	 12439	 12444	 O	 -1	 -1	 0
too	data/source_txt/train/t4_psychology_2_303.txt	 12445	 12448	 O	 -1	 -1	 0
narrow	data/source_txt/train/t4_psychology_2_303.txt	 12449	 12455	 O	 -1	 -1	 0
.	data/source_txt/train/t4_psychology_2_303.txt	 12455	 12456	 O	 -1	 -1	 0

Lines 1693-1777. Error in line 1753.

[TOKENIZATION] #4

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t4_psychology_2_202.deft

Content

Behaviorists	 data/source_txt/t4_psychology_rlacroix_202.txt	 32067	 32079	 O	 -1	 -1	 0
such	 data/source_txt/t4_psychology_rlacroix_202.txt	 32080	 32084	 O	 -1	 -1	 0
as	 data/source_txt/t4_psychology_rlacroix_202.txt	 32085	 32087	 O	 -1	 -1	 0
Joseph	 data/source_txt/t4_psychology_rlacroix_202.txt	 32088	 32094	 O	 -1	 -1	 0
Wolpe	 data/source_txt/t4_psychology_rlacroix_202.txt	 32095	 32100	 O	 -1	 -1	 0
also	 data/source_txt/t4_psychology_rlacroix_202.txt	 32101	 32105	 O	 -1	 -1	 0
influenced	 data/source_txt/t4_psychology_rlacroix_202.txt	 32106	 32116	 O	 -1	 -1	 0
Ellis	 data/source_txt/t4_psychology_rlacroix_202.txt	 32117	 32122	 O	 -1	 -1	 0
’s	 data/source_txt/t4_psychology_rlacroix_202.txt	 32122	 32124	 O	 -1	 -1	 0
therapeutic	 data/source_txt/t4_psychology_rlacroix_202.txt	 32125	 32136	 O	 -1	 -1	 0
approach	 data/source_txt/t4_psychology_rlacroix_202.txt	 32137	 32145	 O	 -1	 -1	 0
(	 data/source_txt/t4_psychology_rlacroix_202.txt	 32146	 32147	 O	 -1	 -1	 0
National	 data/source_txt/t4_psychology_rlacroix_202.txt	 32147	 32155	 O	 -1	 -1	 0
Association	 data/source_txt/t4_psychology_rlacroix_202.txt	 32156	 32167	 O	 -1	 -1	 0
of	 data/source_txt/t4_psychology_rlacroix_202.txt	 32168	 32170	 O	 -1	 -1	 0
Cognitive	 data/source_txt/t4_psychology_rlacroix_202.txt	 32171	 32180	 O	 -1	 -1	 0
-	 data/source_txt/t4_psychology_rlacroix_202.txt	 32180	 32181	 O	 -1	 -1	 0
Behavioral	 data/source_txt/t4_psychology_rlacroix_202.txt	 32181	 32191	 O	 -1	 -1	 0
Therapists	 data/source_txt/t4_psychology_rlacroix_202.txt	 32192	 32202	 O	 -1	 -1	 0
,	 data/source_txt/t4_psychology_rlacroix_202.txt	 32202	 32203	 O	 -1	 -1	 0
2009).Cognitive	 data/source_txt/t4_psychology_rlacroix_202.txt	 32204	 32219	 B-Term	 T161	 0	 AKA
-	 data/source_txt/t4_psychology_rlacroix_202.txt	 32219	 32220	 I-Term	 T161	 0	 AKA
behavioral	 data/source_txt/t4_psychology_rlacroix_202.txt	 32220	 32230	 I-Term	 T161	 0	 AKA
therapy	 data/source_txt/t4_psychology_rlacroix_202.txt	 32231	 32238	 I-Term	 T161	 0	 AKA
(	 data/source_txt/t4_psychology_rlacroix_202.txt	 32239	 32240	 O	 -1	 -1	 0
CBT	 data/source_txt/t4_psychology_rlacroix_202.txt	 32240	 32243	 B-Alias-Term	 T160	 T161	 AKA
)	 data/source_txt/t4_psychology_rlacroix_202.txt	 32243	 32244	 O	 -1	 -1	 0
helps	 data/source_txt/t4_psychology_rlacroix_202.txt	 32245	 32250	 B-Definition	 T159	 T161	 Direct-Defines
clients	 data/source_txt/t4_psychology_rlacroix_202.txt	 32251	 32258	 I-Definition	 T159	 T161	 Direct-Defines
examine	 data/source_txt/t4_psychology_rlacroix_202.txt	 32259	 32266	 I-Definition	 T159	 T161	 Direct-Defines
how	 data/source_txt/t4_psychology_rlacroix_202.txt	 32267	 32270	 I-Definition	 T159	 T161	 Direct-Defines
their	 data/source_txt/t4_psychology_rlacroix_202.txt	 32271	 32276	 I-Definition	 T159	 T161	 Direct-Defines
thoughts	 data/source_txt/t4_psychology_rlacroix_202.txt	 32277	 32285	 I-Definition	 T159	 T161	 Direct-Defines
affect	 data/source_txt/t4_psychology_rlacroix_202.txt	 32286	 32292	 I-Definition	 T159	 T161	 Direct-Defines
their	 data/source_txt/t4_psychology_rlacroix_202.txt	 32293	 32298	 I-Definition	 T159	 T161	 Direct-Defines
behavior	 data/source_txt/t4_psychology_rlacroix_202.txt	 32299	 32307	 I-Definition	 T159	 T161	 Direct-Defines
.	 data/source_txt/t4_psychology_rlacroix_202.txt	 32307	 32308	 O	 -1	 -1	 0

Lines 5568-5604. Error in line 5588.

Missing relations

I found 266 examples (context-windows) which have tokens with root_ids marked as "0" and tag_id, say TXXX, but there are no tokens with root_id TXXX in example in train and dev set.

For example there is such T105 tokens:

data/source_txt/t3_physics_2_101.deft
TOKEN ROOT_ID TAG_ID RELATION
3161 -1 -1 0
. -1 -1 0
Another -1 -1 0
is -1 -1 0
what -1 -1 0
Democritus -1 -1 0
in -1 -1 0
particular -1 -1 0
believed -1 -1 0
— -1 -1 0
that -1 -1 0
there 0 T106 0
is 0 T106 0
a 0 T106 0
smallest 0 T106 0
unit 0 T106 0
that 0 T106 0
can 0 T106 0
not 0 T106 0
be 0 T106 0
further 0 T106 0
subdivided 0 T106 0
. -1 -1 0
Democritus -1 -1 0
called -1 -1 0
this T106 T194 Refers-To
the 0 T105 0
atom 0 T105 0

. -1 -1 0
We -1 -1 0
now -1 -1 0
know -1 -1 0
that -1 -1 0
atoms -1 -1 0
themselves -1 -1 0
can -1 -1 0
be -1 -1 0
subdivided -1 -1 0
, -1 -1 0
but -1 -1 0
their -1 -1 0
identity -1 -1 0
is -1 -1 0
destroyed -1 -1 0
in -1 -1 0
the -1 -1 0
process -1 -1 0
, -1 -1 0
so -1 -1 0
the -1 -1 0
Greeks -1 -1 0
were -1 -1 0
correct -1 -1 0
in -1 -1 0
a -1 -1 0
respect -1 -1 0
. -1 -1 0

Duplicate information

There appears to be some duplicate information at least in the data/deft_files included in three files which have "jlee" in their names vs the 0,1,2 in the rest. Are these files meant to be present or have they slipped in by mistake? Could you clarify this.

Also i noted the same sentences can appear multiple times (even in the same group of three) within an individual file, which i assume has arisen due to sampling the different "bold" terms and producing independent instances. Were these annotated in BRAT in separate documents or within the same one? Just looking at the potential reuse of annotation IDs (Txx etc) which may occur.

Thanks

Tony

A few bad tags in deft_files

Hi

It appears there are a tiny number of tags in the files missing the appropriate BIO prefix:

Definition | 15
Term | 10
Referential-Definition | 2
Alias-Term | 1
Secondary-Definition | 1

Could you confirm if these should have the missing prefix or signify something else, thanks.

Tony

Double sentences in corpus

I might be missing something, but why do some sentences appear twice in a row in the corpus? E. g. the sentence "There are usually acknowledgment and reference sections as well as an abstract ( a concise summary ) at the beginning of the paper ." appears twice in a row in the file data/deft_files/train/t1_biology_0_0.deft.

[TOKENIZATION] #8

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t4_psychology_0_101.deft

Content

Merkel	 data/source_txt/t4_psychology_jlee_101.txt	 4569	 4575	 O	 -1	 -1	 0
’s	 data/source_txt/t4_psychology_jlee_101.txt	 4575	 4577	 O	 -1	 -1	 0
disks	 data/source_txt/t4_psychology_jlee_101.txt	 4578	 4583	 O	 -1	 -1	 0
respond	 data/source_txt/t4_psychology_jlee_101.txt	 4584	 4591	 O	 -1	 -1	 0
to	 data/source_txt/t4_psychology_jlee_101.txt	 4592	 4594	 O	 -1	 -1	 0
light	 data/source_txt/t4_psychology_jlee_101.txt	 4595	 4600	 O	 -1	 -1	 0
pressure	 data/source_txt/t4_psychology_jlee_101.txt	 4601	 4609	 O	 -1	 -1	 0
,	 data/source_txt/t4_psychology_jlee_101.txt	 4609	 4610	 O	 -1	 -1	 0
while	 data/source_txt/t4_psychology_jlee_101.txt	 4611	 4616	 O	 -1	 -1	 0
Ruffini	 data/source_txt/t4_psychology_jlee_101.txt	 4617	 4624	 O	 -1	 -1	 0
corpuscles	 data/source_txt/t4_psychology_jlee_101.txt	 4625	 4635	 O	 -1	 -1	 0
detect	 data/source_txt/t4_psychology_jlee_101.txt	 4636	 4642	 O	 -1	 -1	 0
stretch	 data/source_txt/t4_psychology_jlee_101.txt	 4643	 4650	 O	 -1	 -1	 0
(	 data/source_txt/t4_psychology_jlee_101.txt	 4651	 4652	 O	 -1	 -1	 0
Abraira	 data/source_txt/t4_psychology_jlee_101.txt	 4652	 4659	 O	 -1	 -1	 0
&	 data/source_txt/t4_psychology_jlee_101.txt	 4660	 4661	 O	 -1	 -1	 0
Ginty	 data/source_txt/t4_psychology_jlee_101.txt	 4662	 4667	 O	 -1	 -1	 0
,	 data/source_txt/t4_psychology_jlee_101.txt	 4667	 4668	 O	 -1	 -1	 0
2013).There	 data/source_txt/t4_psychology_jlee_101.txt	 4669	 4680	 O	 -1	 -1	 0
are	 data/source_txt/t4_psychology_jlee_101.txt	 4681	 4684	 O	 -1	 -1	 0
many	 data/source_txt/t4_psychology_jlee_101.txt	 4685	 4689	 O	 -1	 -1	 0
types	 data/source_txt/t4_psychology_jlee_101.txt	 4690	 4695	 O	 -1	 -1	 0
of	 data/source_txt/t4_psychology_jlee_101.txt	 4696	 4698	 O	 -1	 -1	 0
sensory	 data/source_txt/t4_psychology_jlee_101.txt	 4699	 4706	 O	 -1	 -1	 0
receptors	 data/source_txt/t4_psychology_jlee_101.txt	 4707	 4716	 O	 -1	 -1	 0
located	 data/source_txt/t4_psychology_jlee_101.txt	 4717	 4724	 O	 -1	 -1	 0
in	 data/source_txt/t4_psychology_jlee_101.txt	 4725	 4727	 O	 -1	 -1	 0
the	 data/source_txt/t4_psychology_jlee_101.txt	 4728	 4731	 O	 -1	 -1	 0
skin	 data/source_txt/t4_psychology_jlee_101.txt	 4732	 4736	 O	 -1	 -1	 0
,	 data/source_txt/t4_psychology_jlee_101.txt	 4736	 4737	 O	 -1	 -1	 0
each	 data/source_txt/t4_psychology_jlee_101.txt	 4738	 4742	 O	 -1	 -1	 0
attuned	 data/source_txt/t4_psychology_jlee_101.txt	 4743	 4750	 O	 -1	 -1	 0
to	 data/source_txt/t4_psychology_jlee_101.txt	 4751	 4753	 O	 -1	 -1	 0
specific	 data/source_txt/t4_psychology_jlee_101.txt	 4754	 4762	 O	 -1	 -1	 0
touch	 data/source_txt/t4_psychology_jlee_101.txt	 4763	 4768	 O	 -1	 -1	 0
-	 data/source_txt/t4_psychology_jlee_101.txt	 4768	 4769	 O	 -1	 -1	 0
related	 data/source_txt/t4_psychology_jlee_101.txt	 4769	 4776	 O	 -1	 -1	 0
stimuli	 data/source_txt/t4_psychology_jlee_101.txt	 4777	 4784	 O	 -1	 -1	 0
.	 data/source_txt/t4_psychology_jlee_101.txt	 4784	 4785	 O	 -1	 -1	 0

Lines 797-835. Error in line 815.

Deft and source mismatch

Data mismatch between deft files and corresponding source files: source doesn't represent deft.

For example, in data/deft_files/dev/t1_biology_0_0.deft

2	data/source_txt/dev/t1_biology_0_0.txt	 0	 1	 O	 -1	 -1	 0
.	data/source_txt/dev/t1_biology_0_0.txt	 1	 2	 O	 -1	 -1	 0

It	data/source_txt/dev/t1_biology_0_0.txt	 3	 5	 O	 -1	 -1	 0
becomes	data/source_txt/dev/t1_biology_0_0.txt	 6	 13	 O	 -1	 -1	 0

and in data/source_txt/dev/t1_biology_0_0.txt

5. Science includes such diverse fields as astronomy, biology, computer sciences, geology, logic, physics, chemistry

Also for dev dataset - in the deft folder there are files for which there are no corresponding files in the source folder and vice versa:
t4_psychology_2_202.deft, t5_economic_2_202.deft, t5_economic_2_303.deft, t7_government_2_101.deft, t7_government_2_202.deft

t4_psychology_1_202.txt, t5_economic_1_202.txt, t5_economic_1_303.txt, t7_government_1_101.txt

tokenization errors cause label errors

There are some tokenization errors in your data and tokenization errors cause label errors. for example:
requickened”—assigned data/source_txt/train/t2_history_2_0.txt 4149 4170 Term T19 0 Direct-Defines(

requickened”—assigned data/source_txt/train/t2_history_2_0.txt 4149 4170 Term T19 0 Direct-Defines
)

money”—an data/source_txt/train/t2_history_1_101.txt 766 775 Term T3 0 Direct-Defines(

money”—an data/source_txt/train/t2_history_1_101.txt 766 775 Term T3 0 Direct-Defines
)

law”—is data/source_txt/train/t6_sociology_2_0.txt 18785 18792 Qualifier T210 T211 Supplements (

law”—is data/source_txt/train/t6_sociology_2_0.txt 18785 18792 Qualifier T210 T211 Supplements
)
Where 'Term ' should be ''I-Term" or "B-Term" and "Qualifier " should be "B-Qualifier" or "I-Qualifier". I found that there are 49 such errors in train. Please fix these bug, because this bug affects task 2.

Add readme in /evaluation/program

It would be nice to add a readme file in /evaluation/program. There is a readme in evaluation/old/README.md but since it is in an old folder it is unclear whether it is still up-to-date.

Understanding the output of task1_converter

The output of the task1_converter program doesn't seem to be very clean, I see a lot of sentences like " . 178" "0". Is this expected, are we supposed to clean such sentences up or am I using the program wrongly? To run the program I use python task1_converter.py ./data/deft_files/train ./output

labeled data not the same size and unlabeled one

Hello,

Checking at the released labeled data.
It looks like for task 2, unlabeled data does not match the size of labeled data.

For instance for the file task_2_t1_biology_0_0.deft, labeled data has 519 lines while the unlabeled one has 475 lines.

Is there a reason why we can observe this ?

Thanks

[Bug] in evaluation

I tested the evaluation scripts with the provided codes, config file, and test files.
However, I found the performance is quite different from the human calculation.

I doubt that the parameter name should be labels rather than target_names in semeval2020_0601_eval.py and semeval2020_0602_eval.py.
OR it would be nice that the authors could specify the version of scikit-learn they used.

The script output is shown below and my scikit-learn version is 0.20.3.

              precision    recall  f1-score   support

      HasDef       0.00      0.00      0.00         2
       NoDef       0.60      1.00      0.75         3

   micro avg       0.60      0.60      0.60         5
   macro avg       0.30      0.50      0.37         5
weighted avg       0.36      0.60      0.45         5


              precision    recall  f1-score   support

      B-Term       0.50      0.50      0.50         2
      I-Term       1.00      1.00      1.00         2
B-Definition       0.88      0.78      0.82         9
I-Definition       0.67      0.80      0.73         5

   micro avg       0.78      0.78      0.78        18
   macro avg       0.76      0.77      0.76        18
weighted avg       0.79      0.78      0.78        18


{'Direct-Defines': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'macro': {'p': 1.0, 'f': 1.0}}

Relation annotations with missing head entities

There are several relations annotated that link to a missing head entity id. I have fixed some in the dev set (#19), but there a lot more examples in the training set.

This is an example from "data/deft_files/train/t1_biology_2_0.deft" (missing T220):

1802 99  data/source_txt/train/t1_biology_2_0.txt   12106   12108   O   -1  -1  0
1803 . data/source_txt/train/t1_biology_2_0.txt   12108   12109   O   -1  -1  0
1804 litmus  data/source_txt/train/t1_biology_2_0.txt   12119   12125   B-Alias-Term  T219  T220  AKA
1805 or  data/source_txt/train/t1_biology_2_0.txt   12126   12128   O   -1  -1  0                               
1806 pH  data/source_txt/train/t1_biology_2_0.txt   12129   12131   O   -1  -1  0
1807 paper data/source_txt/train/t1_biology_2_0.txt   12132   12137  B-Alias-Term-frag  T219-frag   T219  fragment
1808 , data/source_txt/train/t1_biology_2_0.txt   12137   12138   O   -1  -1  0
1809 filter  data/source_txt/train/t1_biology_2_0.txt   12139   12145   B-Definition  T221  T220  Direct-Defines
1810 paper data/source_txt/train/t1_biology_2_0.txt   12146   12151   I-Definition  T221  T220  Direct-Defines
1811 that  data/source_txt/train/t1_biology_2_0.txt   12152   12156   I-Definition  T221  T220  Direct-Defines
1812 has data/source_txt/train/t1_biology_2_0.txt   12157   12160   I-Definition  T221  T220  Direct-Defines

[TOKENIZATION] #11

Update: examples were from old data. Nowadays it is from current repository data

Filepath

train/t6_sociology_1_101.deft

Content

Madame	 data/source_txt/t6_sociology_mkaplan_101.txt	 15507	 15513	 O	 -1	 -1	 0
Jeanne	 data/source_txt/t6_sociology_mkaplan_101.txt	 15514	 15520	 O	 -1	 -1	 0
Calment	 data/source_txt/t6_sociology_mkaplan_101.txt	 15521	 15528	 O	 -1	 -1	 0
of	 data/source_txt/t6_sociology_mkaplan_101.txt	 15529	 15531	 O	 -1	 -1	 0
France	 data/source_txt/t6_sociology_mkaplan_101.txt	 15532	 15538	 O	 -1	 -1	 0
was	 data/source_txt/t6_sociology_mkaplan_101.txt	 15539	 15542	 O	 -1	 -1	 0
the	 data/source_txt/t6_sociology_mkaplan_101.txt	 15543	 15546	 O	 -1	 -1	 0
world	 data/source_txt/t6_sociology_mkaplan_101.txt	 15547	 15552	 O	 -1	 -1	 0
's	 data/source_txt/t6_sociology_mkaplan_101.txt	 15552	 15554	 O	 -1	 -1	 0
oldest	 data/source_txt/t6_sociology_mkaplan_101.txt	 15555	 15561	 O	 -1	 -1	 0
living	 data/source_txt/t6_sociology_mkaplan_101.txt	 15563	 15569	 O	 -1	 -1	 0
person	 data/source_txt/t6_sociology_mkaplan_101.txt	 15570	 15576	 O	 -1	 -1	 0
until	 data/source_txt/t6_sociology_mkaplan_101.txt	 15577	 15582	 O	 -1	 -1	 0
she	 data/source_txt/t6_sociology_mkaplan_101.txt	 15583	 15586	 O	 -1	 -1	 0
died	 data/source_txt/t6_sociology_mkaplan_101.txt	 15587	 15591	 O	 -1	 -1	 0
at	 data/source_txt/t6_sociology_mkaplan_101.txt	 15592	 15594	 O	 -1	 -1	 0
122	 data/source_txt/t6_sociology_mkaplan_101.txt	 15595	 15598	 O	 -1	 -1	 0
years	 data/source_txt/t6_sociology_mkaplan_101.txt	 15599	 15604	 O	 -1	 -1	 0
old	 data/source_txt/t6_sociology_mkaplan_101.txt	 15605	 15608	 O	 -1	 -1	 0
;	 data/source_txt/t6_sociology_mkaplan_101.txt	 15608	 15609	 O	 -1	 -1	 0
there	 data/source_txt/t6_sociology_mkaplan_101.txt	 15610	 15615	 O	 -1	 -1	 0
are	 data/source_txt/t6_sociology_mkaplan_101.txt	 15616	 15619	 O	 -1	 -1	 0
currently	 data/source_txt/t6_sociology_mkaplan_101.txt	 15620	 15629	 O	 -1	 -1	 0
six	 data/source_txt/t6_sociology_mkaplan_101.txt	 15630	 15633	 O	 -1	 -1	 0
women	 data/source_txt/t6_sociology_mkaplan_101.txt	 15634	 15639	 O	 -1	 -1	 0
in	 data/source_txt/t6_sociology_mkaplan_101.txt	 15640	 15642	 O	 -1	 -1	 0
the	 data/source_txt/t6_sociology_mkaplan_101.txt	 15643	 15646	 O	 -1	 -1	 0
world	 data/source_txt/t6_sociology_mkaplan_101.txt	 15647	 15652	 O	 -1	 -1	 0
whose	 data/source_txt/t6_sociology_mkaplan_101.txt	 15653	 15658	 O	 -1	 -1	 0
ages	 data/source_txt/t6_sociology_mkaplan_101.txt	 15659	 15663	 O	 -1	 -1	 0
are	 data/source_txt/t6_sociology_mkaplan_101.txt	 15664	 15667	 O	 -1	 -1	 0
well	 data/source_txt/t6_sociology_mkaplan_101.txt	 15668	 15672	 O	 -1	 -1	 0
documented	 data/source_txt/t6_sociology_mkaplan_101.txt	 15673	 15683	 O	 -1	 -1	 0
as	 data/source_txt/t6_sociology_mkaplan_101.txt	 15684	 15686	 O	 -1	 -1	 0
115	 data/source_txt/t6_sociology_mkaplan_101.txt	 15687	 15690	 O	 -1	 -1	 0
years	 data/source_txt/t6_sociology_mkaplan_101.txt	 15691	 15696	 O	 -1	 -1	 0
or	 data/source_txt/t6_sociology_mkaplan_101.txt	 15697	 15699	 O	 -1	 -1	 0
older	 data/source_txt/t6_sociology_mkaplan_101.txt	 15700	 15705	 O	 -1	 -1	 0
(	 data/source_txt/t6_sociology_mkaplan_101.txt	 15706	 15707	 O	 -1	 -1	 0
Diebel	 data/source_txt/t6_sociology_mkaplan_101.txt	 15707	 15713	 O	 -1	 -1	 0
2014)	 data/source_txt/t6_sociology_mkaplan_101.txt	 15714	 15719	 O	 -1	 -1	 0
.	 data/source_txt/t6_sociology_mkaplan_101.txt	 15719	 15720	 O	 -1	 -1	 0

Lines 2489-2530. Error in line 2529.

Data Mismatch and Tokenization error

I found some data mismatch between deft and source files
In file deft_files/train/t7_government_0_101.deft and source_txt/train/t7_government_0_101.txt

The text after L110 and this line does not exist in source text file.

This text in source file is not present in deft file

Also there is tokenization error in

Line 67 , Line 107, Line 263

This is one of the examples, there might be more of these kinds. Please try to resolve this ASAP.
@sashaspala @Franck-Dernoncourt

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.