Comments (2)
I have removed the extraneous files you mentioned.
Sentences that are duplicated within the same context window are due to overlapping label spans - you can read more about that in the FAQs here and in a previous issue here.
Sentences that are repeats from previous context windows are typically a result of bold words being found in neighboring sentences. So, if we have 4 sentences, and bold words appear in sentence 2 and sentence 3, we will have a context window structures like this:
SENTENCE 1
SENTENCE 2
SENTENCE 3
SENTENCE 2
SENTENCE 3
SENTENCE 4
This is to make sure that we provide context windows for all sentences where possible, increasing the chances that the annotators are able to find any relevant cross-sentence relationships.
At annotation time, we treat these cases exactly the same as you see in the text files, so annotators would see duplicated sentences and mark them individually. That does mean that if you see the same sentence repeated in two different context windows, their repeated annotations will have new tag IDs. AFAIK there should be no tag ID collisions between context windows in the .deft files.
from deft_corpus.
Thanks for those other pointers. I thought it may be due to the multiple instances per sentence. I'm not sure if the last bit you mention regarding reuse of IDs is matching what i found in the data after submitting this issue. In data/deft_files/train/t1_biology_2_404.deft looking for the start offset 26662 would find to entries from the repeated sentence but both sentences contained the T231 span even though the T230 it referenced was only in one copy of the sentence (as you explain due to the overlapping nature of spans). In the latest files you uploaded only one copy of the sentence is present, and now the T228 term is in the one copy but the matching definition for it is no longer present as the second copy which contained it has gone. In this case its not so bad because there is no annotation with a reference to one which is not there but could that still happen?
from deft_corpus.
Related Issues (20)
- Double sentences in corpus HOT 1
- [TOKENIZATION] Tokens with strange points and brackets #1 HOT 7
- [TOKENIZATION] #2
- [TOKENIZATION] #3
- [TOKENIZATION] #4
- [TOKENIZATION] #5
- [TOKENIZATION] #6
- [TOKENIZATION] #7
- [TOKENIZATION] #8
- [TOKENIZATION] #9
- [TOKENIZATION] #10
- [TOKENIZATION] #11
- A few bad tags in deft_files
- Bug - handling last sentence in task1_converter.py
- Missing subtask 3 label: Qualifies/Supplements HOT 1
- Missing relations HOT 4
- CSV Parser in the evaluation script is not handling quotes correctly HOT 1
- labeled data not the same size and unlabeled one
- no contract dataset HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deft_corpus.