koomri / text-segmentation Goto Github PK

View Code? Open in Web Editor NEW

242.0 242.0 57.0 4.9 MB

Implementation of the paper: Text Segmentation as a Supervised Learning Task

Python 99.89% Shell 0.11%

dataset deep-learning machine-learning neural-network nlp text-segmentation

text-segmentation's People

Stargazers

Watchers

text-segmentation's Issues

Why is softmax not used in training?

In the train function, softmax activation is not applied but it is applied in the validate function. Why is that so? Thanks.

Redundant padding method in models?

Why the need for both pad(self, s, max_length) and pad_document(self, d, max_document_length) in from_presentation.py, max_sentence_embedding.py and single_lstm.py? To me, it seems that those two methods perform the same task - they just have different variable names.

exceptions file

i want to know where is the source of exceptions file.

Questions

Hey,

I was curious, how effective is this tool at segmenting text? Can it also tokenise words?

Can we download a pre-trained model or do we have to train it ourselves? If so, why is that, just out of curiosity?

Is this one of the best such tools out there at the moment or is there any standard tool or machine learning library at the moment which offers a method like this?

Thanks very much.

Error in loading the pretrained model

While trying to load the model given here, I'm facing the following problem:

.....
.....
  File "/home/sid/text-segmentation/evaluate.py", line 13, in load_model
    model = torch.load(f)
  File "/home/sid/miniconda2/envs/textseg/lib/python2.7/site-packages/torch/serialization.py", line 261, in load
    return _load(f, map_location, pickle_module)
  File "/home/sid/miniconda2/envs/textseg/lib/python2.7/site-packages/torch/serialization.py", line 399, in _load
    magic_number = pickle_module.load(f)
cPickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

which pytorch version to use for windows

As i don't have linux OS . I'm stuck with setting up the environment .
I saw that for windows there is no version of torch=0.3.0 . If i download torch=0.4.1(cu80/torch-0.4.1-cp35-cp35m-win_amd64.whl or cpu/torch-0.4.1-cp35-cp35m-win_amd64.whl) or latest version and do .
Whether it will work for me ?

segeval version?

Might not be a big deal
I'd trained the model on my own, with a little bit change here or there, mainly on wiki data structure and python version.
Feeling smooth so far.
I use python 3.7 with segeval (2.0.11)
and the change is
seg.pk --> seg.window.pk.pk
seg.window_diff --> seg.window.windowdiff.window_diff

Thanks for the open source

Pretrained Model

Is there a pretrained model we could use directly?

Error loading a dataset for training

Error in prepare_tensor in evaluate.py

ValueError                                Traceback (most recent call last)
<ipython-input-13-78897f9821a1> in <module>()
----> 1 cutoffs = evaluate.predict_cutoffs(sentences, model, word2vec)

/home/sid/text-segmentation/evaluate.pyc in predict_cutoffs(sentences, model, word2vec)
     40 def predict_cutoffs(sentences, model, word2vec):
     41     word2vec_sentences = text_to_word2vec(sentences, word2vec)
---> 42     tensored_data = prepare_tensor(word2vec_sentences)
     43     batched_tensored_data = []
     44     batched_tensored_data.append(tensored_data)

/home/sid/text-segmentation/evaluate.pyc in prepare_tensor(sentences)
     23     tensored_data = []
     24     for sentence in sentences:
---> 25         tensored_data.append(utils.maybe_cuda(torch.FloatTensor(np.concatenate(sentence))))
     26 
     27     return tensored_data

ValueError: need at least one array to concatenate

Link to a gist containing the error is here

cities and elements dataset

Can you share the processed cities and elements dataset?

ask a question: how long does it take for one epoch?

Can't reproduce Pk on Choi's dataset

Hi,

I can't reproduce the reported Pk of 26.26 on Choi's dataset with the pre-trained model.
When I run the default evaluation script

python test_accuracy.py --cuda --model model_gpu.t7

I get Pk of only 0.3667. What can be wrong here?

Running with threshold: 0.4
Loading word2vec ellapsed: 55.1196949482 seconds
running on Choi
...
2018-10-29 13:41:24,821 - INFO - Finished testing.
2018-10-29 13:41:24,821 - INFO - Average loss: 0.0
2018-10-29 13:41:24,821 - INFO - Average accuracy: 0.8987517337031901
2018-10-29 13:41:24,821 - INFO - Pk: 0.3667.
2018-10-29 13:41:24,821 - INFO - F1: 0.3511.
Seconds to execute to whole flow: 93.9438998699

Window size parameter

text-segmentation/choiloader.py

Line 26 in 874d6ef

window_size = 1

Does this mean we take every sentence separately ? If that's the case, shouldn't we add some context sentences, for example window = 3 ?

koomri / text-segmentation Goto Github PK

text-segmentation's People

Stargazers

Watchers

Forkers

text-segmentation's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs