GithubHelp home page GithubHelp logo

Comments (2)

sashaspala avatar sashaspala commented on June 3, 2024

I have removed the extraneous files you mentioned.

Sentences that are duplicated within the same context window are due to overlapping label spans - you can read more about that in the FAQs here and in a previous issue here.

Sentences that are repeats from previous context windows are typically a result of bold words being found in neighboring sentences. So, if we have 4 sentences, and bold words appear in sentence 2 and sentence 3, we will have a context window structures like this:

SENTENCE 1
SENTENCE 2
SENTENCE 3

SENTENCE 2
SENTENCE 3
SENTENCE 4

This is to make sure that we provide context windows for all sentences where possible, increasing the chances that the annotators are able to find any relevant cross-sentence relationships.

At annotation time, we treat these cases exactly the same as you see in the text files, so annotators would see duplicated sentences and mark them individually. That does mean that if you see the same sentence repeated in two different context windows, their repeated annotations will have new tag IDs. AFAIK there should be no tag ID collisions between context windows in the .deft files.

from deft_corpus.

antonyscerri avatar antonyscerri commented on June 3, 2024

Thanks for those other pointers. I thought it may be due to the multiple instances per sentence. I'm not sure if the last bit you mention regarding reuse of IDs is matching what i found in the data after submitting this issue. In data/deft_files/train/t1_biology_2_404.deft looking for the start offset 26662 would find to entries from the repeated sentence but both sentences contained the T231 span even though the T230 it referenced was only in one copy of the sentence (as you explain due to the overlapping nature of spans). In the latest files you uploaded only one copy of the sentence is present, and now the T228 term is in the one copy but the matching definition for it is no longer present as the second copy which contained it has gone. In this case its not so bad because there is no annotation with a reference to one which is not there but could that still happen?

from deft_corpus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.