Currently, we now feed in data as a long string that doesn't care about the sentence or document edges. This probably does not have a large effect when we run large-scale experiments. Although, it has two problems/downsides.
- Users will start to ask, because it is a strange way to handle data. This will reduce the expanded use of the code in applied settings.
- It limits the use of the code for data with scrambled data, such as copyrighted news data.
- The results with the PELP model will be slightly worse than models that handles this correctly, and the effect would depend on the size of the segments. I.e. our code would be worse for tweets than for literature.
The only difference we need to do is to handle arbitrary segments of text rather than one long string.
The best solution is also to just cap the context at the edges, see example below. Although, still treating the whole data as one dataset (with regard to negative samples).
Example:
"The brown dog. \n It jumps over the lazy fix."
CBOW (window size = 1, observations):
p(the| brown)
p(brown|the, dog)
p(dog | brown )
p(it | jumps )
p(jumps | it, over)
p(over| jumps , the)
...