notai-tech / deepsegment Goto Github PK

View Code? Open in Web Editor NEW

302.0 14.0 57.0 83 KB

A sentence segmenter that actually works!

Home Page: http://bpraneeth.com/projects

License: GNU General Public License v3.0

Python 100.00%

nlp segmentation text text-segmentation deep-learning sentence-segmenter punctuation

deepsegment's People

Contributors

Stargazers

Watchers

deepsegment's Issues

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 5665: character maps to <undefined>

I get this error when I try to initialize a segmenter like so using pre-trained model suggested https://github.com/bedapudi6788/DeepSegment-Models:

segmenter = deepsegment.DeepSegment(
    config_path='content_creation/ai_pretrained_models/deepsegment_eng_v1/deepsegment_eng_v1/config.json'
)

Error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 5665: character maps to <undefined>

Is there an issue with the pre-trained model, or have I set up something wrong ?

AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'

i make my own dataset and i make glove in my own using word2vec, and i got this issue on the last step

AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'

can tell me why

[typo] Find a typo

Describe the bug and error messages (if any)

Find a typo:

https://github.com/bedapudi6788/deepsegment/blob/v2/deepsegment/deepsegment.py#L170
utils should be modified to params

"WARNING:root:Consider using segment_long" when actually using segment_long

Bug description
I am using segment_long to segment a relatively long paragraph, and despite using this specific function I am getting a repeated warning that reads

WARNING:root:Consider using segment_long for longer sentences.

Snippet which gave this bug

from deepsegment import DeepSegment
segmenter = DeepSegment('en')
...
resegmented = segmenter.segment_long(a_mod)

Specify versions of the following libraries

deepsegment==2.3.1
tensorflow==2.2.0
keras==2.3.1

Expected behavior
No warnings at all, given I'm using the suggested function.

Screenshots

Notes
This warning message seems to by-pass warnings.filterwarnings('ignore').

How to train a new model using custom data

I have prepared the data using data_gen.py in master branch, how do I train a new model?

module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'

Versions:

deepsegment 2.3.1

tensorflow 2.4.0

keras 2.3.1

Problem:
As I try to run the basic code for Sentence Segmentation

from deepsegment import DeepSegment
d = 'this is a sentence this is another sentence'
segmenter = DeepSegment('en')
d_seg = segmenter.segment(d)

I get the following error:
AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'

Deepsegment does not segment on custom model

Describe the bug and error messages (if any)
I trained Deepsegment on 1GB of custom data in Swedish. All was successful but when I run inference the model does not segment the text.

The code snippet which gave this error*

for line in lines[:3]:
    print(line)
print('Tot: {}'.format(len(lines)))
--------------------------------
Enligt ett pressmeddelande från Anza är Hamilton Acorn Englands ledande producent av professionella måleriverktyg.
Omsättningen är cirka 150 miljoner kronor och företaget har 136 anställda.
Det var ett paket med flera kilo hasch som hittades av tullen på Landvetters flygplats utanför Göteborg.
Tot: 10126568
--------------------------------
x, y = generate_data(lines[10000:], max_sents_per_example=6, n_examples=10000)
vx, vy = generate_data(lines[:10000], max_sents_per_example=6, n_examples=1000)
--------------------------------
100% (10000 of 10000) |##################| Elapsed Time: 0:00:01 Time:  0:00:01
100% (1000 of 1000) |####################| Elapsed Time: 0:00:00 Time:  0:00:00
--------------------------------
train(x, y, vx, vy, epochs=2, batch_size=64, save_folder='./', glove_path='cc.sv.100.vec')
--------------------------------
Epoch 1/2
157/157 [==============================] - 168s 1s/step - loss: 3.6761
 - f1: 80.93
             precision    recall  f1-score   support

       sent       0.97      0.69      0.81      3425

avg / total       0.97      0.69      0.81      3425


Epoch 00001: f1 improved from -inf to 0.80926, saving model to ./checkpoint
Epoch 2/2
157/157 [==============================] - 166s 1s/step - loss: 3.5520
 - f1: 84.49
             precision    recall  f1-score   support

       sent       0.97      0.75      0.84      3425

avg / total       0.97      0.75      0.84      3425


Epoch 00002: f1 improved from 0.80926 to 0.84494, saving model to ./checkpoint
--------------------------------
from deepsegment import DeepSegment
segmenter = DeepSegment(lang_code=None, checkpoint_path='checkpoint', params_path='params', utils_path='utils', tf_serving=False, checkpoint_name=None)
segmenter.segment('under natten har det varit inbrott i ett kontor vid bredåkra kyrka en person gripen misstänkt för inbrottet polisen skriver på sin facebooksida att en av deras hundförare lyckades spåra upp gärningsmannen och det tillgripna godset personen som är i trettiofemårsåldern greps och sitter nu anhållen ingrid elfstråhle p fyra blekinge')
--------------------------------
['under natten har det varit inbrott i ett kontor vid bredåkra kyrka en person gripen misstänkt för inbrottet polisen skriver på sin facebooksida att en av deras hundförare lyckades spåra upp gärningsmannen och det tillgripna godset personen som är i trettiofemårsåldern greps och sitter nu anhållen ingrid elfstråhle p fyra blekinge']

cc.sv.100.vec is Facebook fasttext 300 vec reduced to 100 in Swedish.

Specify versions of the following libraries

deepsegment / latest
tensorflow / 1.15.2
keras / 2.3.1

Expected behavior
I expected Deepsegment to segment the text.

Screenshots
Nope

Tensors in list passed to 'values' of 'ConcatV2' Op have types [bool, float32] that don't all match.

I am trying to run the sample code on google colab, and I got error:

Tensors in list passed to 'values' of 'ConcatV2' Op have types [bool, float32] that don't all match.

Here is the shared colab.

https://colab.research.google.com/drive/16dVsf_4J_HCAuBn_aNZXQ6qFd44h2gEC

Here is code I tried. The error happened at : segmenter = DeepSegment('en')

from deepsegment import DeepSegment
segmenter = DeepSegment('en')
segmenter.segment('I am Batman i live in gotham')
['I am Batman', 'i live in gotham']

setup for train with custom dataset

I want to reproduce your work with my own dataset.

could you explain dependency and requirements?
setup.py requires only 'segtag' and segtag requires 'numpy'

but I need more detail dependency and configuration
Tensorflow and other modules.

thanks

Deepcorrect not working on model trained using Deepsegment.

Hello,

Apologies if the headline isn't to the point.
Actually I used Deepsegment (https://colab.research.google.com/drive/1CjYbdbDHX1UmIyvn7nDW2ClQPnnNeA_m#scrollTo=K9oMoDwwXgQl) to train a language model on my custom data. However when I use the trained model (HDF format) and parms (JSON format) and run the code below:

### My logic: czech-data -> DeepSegment -> train mode -> DeepCorrect -> punctuated and segmented sentences

from deepcorrect import DeepCorrect                                                                                                                                                                         
DeepCorrect('/home/sagar/.DeepSegment_cs/params', '/home/sagar/.DeepSegment_cs/checkpoint')

There is an error that

UnpicklingError: invalid load key, '{'

As far I understand correctly, deep correct expects the params file to be a pickle file and not a plain text JSON file.
Is there anything wrong with my approach?

Thank You

Getting an error message :AttributeError: '_thread._local' object has no attribute 'value'

I am using keras version 2.3.1 and tensorflow version 2.2.0

I am using DeepSegment:
stam = 'I am Batman i live in gotham'
testText = segmenter.segment(stam)

and i am getting the following error:

File "C:\PythonTestProjects\FlaskApp\testLanguageExtraction.py", line 223, in testConceptDataExtraction
testText = segmenter.segment(stam)
File "C:\PythonTestProjects\FlaskApp\env\lib\site-packages\deepsegment\deepsegment.py", line 215, in segment
all_tags = DeepSegment.seqtag_model.predict(encoded_sents, batch_size=batch_size)
File "C:\PythonTestProjects\FlaskApp\env\lib\site-packages\keras\engine\training.py", line 1452, in predict
if self._uses_dynamic_learning_phase():
File "C:\PythonTestProjects\FlaskApp\env\lib\site-packages\keras\engine\training.py", line 382, in _uses_dynamic_learning_phase
not isinstance(K.learning_phase(), int))
File "C:\PythonTestProjects\FlaskApp\env\lib\site-packages\keras\backend\tensorflow_backend.py", line 73, in symbolic_fn_wrapper
if _SYMBOLIC_SCOPE.value:
AttributeError: '_thread._local' object has no attribute 'value'

question for abbreviation and collocations

How shoud I avoid abbreviation and collocations as nltk?

When I test below sentence, deepsegment recognize i.e. as punkt.

To it Thornton adds to knock into a cocked hat, despite its English sound, and to have an ax to grind. To go for, both in the sense of belligerency and in that of partisanship, is also American, and so is to go through (i. e., to plunder). Of adjectives the list is scarcely less long.

Result:

{
  "nltk_punkt": [
    "To it Thornton adds to knock into a cocked hat, despite its English sound, and to have an ax to grind.",
    "To go for, both in the sense of belligerency and in that of partisanship, is also American, and so is to go through (i. e., to plunder).",
    "Of adjectives the list is scarcely less long."
  ],
  "spacy en_core_web_sm": [
    "To it Thornton adds to knock into a cocked hat, despite its English sound, and to have an ax to grind.",
    "To go for, both in the sense of belligerency and in that of partisanship, is also American, and so is to go through (i. e., to plunder).",
    "Of adjectives the list is scarcely less long."
  ],
  "DeepSegment": [
    "To it Thornton adds to knock into a cocked hat, despite its English sound, and to have an ax to grind.",
    "To go for, both in the sense of belligerency and in that of partisanship, is also American, and so is to go through (i.",
    "e., to plunder).",
    "Of adjectives the list is scarcely less long."
  ]
}

Hi I trained a model in spanish woud you like to add to?

One sentence inside another

Hello,

I'm going to expand a custom training set with new examples. These two sentences are in the set:

Good morning I need help to fix this issue
I need help to fix this issue

In this case, I need that DeepSegment keeps the boundaries of the longest one (1). Considering that both examples are in training, I wonder if the final result would be

['Good morning', 'I need help to fix this issue']

I would like to avoid this but keep both examples as training. Would this be possible after training the model?

Thanks

Add the ability to change the dimension

I have glove 300d vectors only.
Change pls code here and here to this:
def train(x, y, vx, vy, epochs, batch_size, save_folder, glove_path, dim_glove=100)
and
model = seqtag_keras.Sequence(embeddings=embeddings, word_embedding_dim=dim_glove, word_lstm_size = dim_glove)
tnx

Deep Segment model available for download is read only

Hi,

Deep segment model available for download is read only, while trying to load the model it's throwing error:

PermissionError: [Errno 13] Permission denied: 'models/deepsegment_eng'

I tried to remove read only options in security tab of folder, and couple of other options but nothing works. could you please look in to it ?

Thanks

deadline exceeded

grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1585986715.466000000","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2261,"referenced_errors":[{"created":"@1585986715.466000000","description":"Pick Cancelled","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":224,"referenced_errors":[{"created":"@1585986715.466000000","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}]}]}"

Train with a different language

Hello,
I have followed the guide to train a custom DeepSegment that you published on colab.research.google.com
I want to try it on Spanish, so I used a custom corpus of sentences in Spanish and a custom vector file from this corpus.
After training, the checkpoint is saved and then I use DeepSegment as follows:
segmenter = DeepSegment(checkpoint_name='checkpoint')
I didn't specify "es" as Spanish is not supported, so I think that "english" is loaded by default in the object.
Is this the right way to train a different language?
Thanks!

WARNING:root:Consider using segment_long for longer sentences.

I am inputting an entire paragraph to test the segmentation behavior. I am using segment_long, but when I do and leave the n_window = 10, I recieve the following error and the process seems to stall

Training other language

I have a question. Is it possible to train such for other languages as well? If you can guide how to do this?

Error with Keras 2.4.2 and 2.4.3

Hello!

I got an "automatic" update of Keras from 2.3.1 to 2.4.3 (I'm using pipenv, and Keras was set to "*" in the Pipfile).
My code won't run anymore with Keras 2.4.3, I'm getting the error below at runtime.
No issue with 2.3.1 though.

(I don't mind using Keras 2.3.1 so not really a blocking issue, just thought you'd wanted to know.)

Traceback (most recent call last): File "xxx/main.py", line 23, in <module> segmenter = DeepSegment('fr')
m = K.slice(states[3], [0, t], [-1, 2]) AttributeError: module 'keras.backend' has no attribute 'slice'
I also have an error with Keras 2.4.2:
Exception occured: in user code:
/xxx/lib/python3.8/site-packages/tensorflow/python/keras/utils/tf_utils.py:140 get_reachable_from_inputs raise TypeError('Expected Operation, Variable, or Tensor, got ' + str(x)) TypeError: Expected Operation, Variable, or Tensor, got 0

Installation failed with pip3

I got this error by installing via pip in an environment using python3

Using cached https://files.pythonhosted.org/packages/93/4b/979db9e44be09f71e85c9c8cfc42f258adfb7d93ce01deed2788b2948919/logging-0.4.9.6.tar.gz
Complete output from command python setup.py egg_info:
running egg_info
creating pip-egg-info/logging.egg-info
writing pip-egg-info/logging.egg-info/PKG-INFO
writing dependency_links to pip-egg-info/logging.egg-info/dependency_links.txt
writing top-level names to pip-egg-info/logging.egg-info/top_level.txt
writing manifest file 'pip-egg-info/logging.egg-info/SOURCES.txt'
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-sb7rjvuw/logging/setup.py", line 13, in
packages = ["logging"],
File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/command/egg_info.py", line 278, in run
self.find_sources()
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/command/egg_info.py", line 293, in find_sources
mm.run()
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/command/egg_info.py", line 524, in run
self.add_defaults()
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/command/egg_info.py", line 560, in add_defaults
sdist.add_defaults(self)
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/command/py36compat.py", line 34, in add_defaults
self._add_defaults_python()
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/command/sdist.py", line 127, in _add_defaults_python
build_py = self.get_finalized_command('build_py')
File "/usr/lib/python3.5/distutils/cmd.py", line 298, in get_finalized_command
cmd_obj = self.distribution.get_command_obj(command, create)
File "/usr/lib/python3.5/distutils/dist.py", line 846, in get_command_obj
klass = self.get_command_class(command)
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/dist.py", line 635, in get_command_class
self.cmdclass[command] = cmdclass = ep.load()
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/pkg_resources/init.py", line 2229, in load
return self.resolve()
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/pkg_resources/init.py", line 2235, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/command/build_py.py", line 15, in
from setuptools.lib2to3_ex import Mixin2to3
File "/home/frejus/.virtualenvs/labforsims_neonat/local/lib/python3.5/site-packages/setuptools/lib2to3_ex.py", line 12, in
from lib2to3.refactor import RefactoringTool, get_fixers_from_package
File "/usr/lib/python3.5/lib2to3/refactor.py", line 19, in
import logging
File "/tmp/pip-install-sb7rjvuw/logging/logging/init.py", line 618
raise NotImplementedError, 'emit must be implemented '
^
SyntaxError: invalid syntax

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-sb7rjvuw/logging/

Support for Turkish

The work done is very valuable, thank you very much. Do you plan to support Turkish?
Or is it possible for me to train for Turkish with custom data.

load checkpoint locally

I try to load the checkpoint locally using this:
seg = DeepSegment(lang_code=None, checkpoint_path="path/to/checkpoint")

but it does not work as it will always download the checkpoint.

what is the best way to do that?

ModuleNotFoundError

Describe the bug and error messages (if any)
Can't import the module

The code snippet which gave this error*


Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from deepsegment import DeepSegment
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'deepsegment'

Specify versions of the following libraries

deepsegment - possibly 1.0.1 (this is the message I get when installing using pip install -upgrade deepsegment: Successfully installed deepsegment-1.0.1 logging-0.4.9.6 numpy-1.16.6 seqtag-1.0.3
tensorflow/ tensorflow-gpu - 1.14.0
keras - 2.3.1

Expected behavior
I expected for the module to be imported and to be able to follow the examples in the README

Thank you!

Issue with other statement

I am trying to test with my situation where I have lots of raw data with or without punctuation symbols. Couple of examples are below. First example has no punctuation and second has sentence separated by comma with spelling mistake.

When I run this statement through example code, I get no split at all.

It is likely your code may not expect raw statements as what I have. I don't have control on incoming data in raw format. I also receive this type of statements in 1000s so there is no way for manually modify each and every. Is there anything which I can do to make this work ?

DRIVE WITH EXCESS BLOOD ALCOHOL SPEED-EXCEED BY 15 KM/HR OR LESS FAIL TO SIGNAL DRIVE UNDER DISQUALIFICATION

Breach re 17/12/06 DRIVE WHILST AUTHORISATION SUSPENDED (2 CHARGES), EX PRESC CONC 3HRS-BREATH-DRIVER VECHICLE (3 CHARGES), DRIVE WHILST DISQUALIFIED

`download` function described in blog post isn't there?

From the blog post:

pip install --upgrade deepsegment
import deepsegment
deepsegment.download('eng_fra_ita')

This doesn't seem to work (there is no download method).

tensorflow serving

Hey!
I want to use this with tensorflow serving but having trouble understanding how to generate an hd5 file. Any ideas how I can do that?

Thanks

h5py.h5f.open OSError: Unable to open file (truncated file: eof = 16777216, sblock->base_addr = 0, stored_eof = 80443280)

Describe the bug and error messages (if any)
I've tried to run:

segmenter = DeepSegment('en')
text = Path('data/bandt.txt').read_text()
tokens = segmenter.segment(text)

In both python3.7 and python3.6 and I'm getting this same error:

Traceback (most recent call last):
  File "ds.py", line 4, in <module>
    segmenter = DeepSegment('en')
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/deepsegment/deepsegment.py", line 140, in __init__
    DeepSegment.seqtag_model.load_weights(checkpoint_path)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/engine/saving.py", line 492, in load_wrapper
    return load_function(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/engine/network.py", line 1221, in load_weights
    with h5py.File(filepath, mode='r') as f:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 16777216, sblock->base_addr = 0, stored_eof = 80443280)

The code snippet which gave this error*

Specify versions of the following libraries

deepsegment: 2.3.0
tensorflow: 1.13.1 / tensorflow-cpu: 1.15.0 / tensorflow-gpu (not installed)
keras: 2.3.1

Expected behavior
For the sentences in text to be tokenized

segment_long, n_window optimal length

Describe the bug and error messages (if any)
This is not a bug, however a request for clarification on the optimal length of n_window based on the string length. Is there a factor that should be considered when setting n_window? I am feeding in various length phrases to be broken up, so I could see setting this dynamically.

Thank you

Training loss: nan

Describe the bug and error messages (if any)
Training loss is nan after training for half an epoch. Is there a problem with my params?

Also, batch_size of 32 is as high as I can go. Everything above will OOM.

These params will consume all 32GB of RAM and some swap. May be related to the warning in the log: "UserWarning: Converting sparse IndexedSlices to a dense Tensor with 120289200 elements. This may consume a large amount of memory."

Hardware
Intel i7 32GB RAM + 2080Ti 11GB

The code snippet which gave this error*
Training code:

from deepsegment import train, generate_data

import unicodedata
import re
from tqdm import tqdm

lines = ...Reading all lines from 1GB text file...

print('ok')

x, y = generate_data(lines[10000:], max_sents_per_example=6, n_examples=1000000)
vx, vy = generate_data(lines[:10000], max_sents_per_example=6, n_examples=100000)

train(x, y, vx, vy, epochs=15, batch_size=32, save_folder='./', glove_path='cc.sv.100.vec')

Log:

Using TensorFlow backend.
WARNING: Logging before flag parsing goes to stderr.
W0611 11:19:13.705538 139626754037568 deepsegment.py:22] Tensorflow serving is not installed. Cannot be used with tesnorflow serving docker images.
W0611 11:19:13.705653 139626754037568 deepsegment.py:23] Run pip install tensorflow-serving-api==1.12.0 if you want to use with tf serving.
W0611 11:19:13.706187 139626754037568 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/deepsegment/train.py:9: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0611 11:19:13.706330 139626754037568 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/deepsegment/train.py:11: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-06-11 11:19:13.719930: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-11 11:19:13.726062: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-06-11 11:19:13.850172: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:13.851079: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4d4fe50 executing computations on platform CUDA. Devices:
2020-06-11 11:19:13.851095: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-11 11:19:13.863673: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4008000000 Hz
2020-06-11 11:19:13.864927: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a1f2d0 executing computations on platform Host. Devices:
2020-06-11 11:19:13.864978: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-06-11 11:19:13.865398: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:13.866497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:01:00.0
2020-06-11 11:19:13.869341: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-11 11:19:13.906978: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-11 11:19:13.925570: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-06-11 11:19:13.931635: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-06-11 11:19:13.972803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-06-11 11:19:13.999081: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-06-11 11:19:14.073585: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-11 11:19:14.073906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.075861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.077537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-06-11 11:19:14.078231: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-11 11:19:14.081715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-11 11:19:14.081795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2020-06-11 11:19:14.081834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2020-06-11 11:19:14.082829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.084705: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.086398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10309 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10126601/10126601 [02:57<00:00, 57064.71it/s]
ok

100% (1000000 of 1000000) |############################################################################################################################################################################| Elapsed Time: 0:01:23 Time:  0:01:23

100% (100000 of 100000) |##############################################################################################################################################################################| Elapsed Time: 0:00:06 Time:  0:00:06
2020-06-11 11:29:57.768082: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.768525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:01:00.0
2020-06-11 11:29:57.768556: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-11 11:29:57.768567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-11 11:29:57.768577: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-06-11 11:29:57.768587: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-06-11 11:29:57.768596: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-06-11 11:29:57.768605: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-06-11 11:29:57.768615: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-11 11:29:57.768665: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.769086: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.769471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-06-11 11:29:57.769490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-11 11:29:57.769497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2020-06-11 11:29:57.769502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2020-06-11 11:29:57.769566: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.769987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.770382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10309 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
W0611 11:30:00.458791 139626754037568 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3794: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py:90: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 120289200 elements. This may consume a large amount of memory.
  num_elements)
W0611 11:30:08.525202 139626754037568 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Epoch 1/15
2020-06-11 11:30:10.113686: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

    1/31250 [..............................] - ETA: 19:39:54 - loss: 3.1761���������������������������������������������������������������������������
    2/31250 [..............................] - ETA: 11:03:17 - loss: 3.0586���������������������������������������������������������������������������
    3/31250 [..............................] - ETA: 8:08:36 - loss: 3.3094 ��������������������������������������������������������������������������
    4/31250 [..............................] - ETA: 6:34:41 - loss: 3.4420��������������������������������������������������������������������������
    5/31250 [..............................] - ETA: 5:38:11 - loss: 3.3396��������������������������������������������������������������������������
    6/31250 [..............................] - ETA: 5:00:34 - loss: 3.1736��������������������������������������������������������������������������
    7/31250 [..............................] - ETA: 4:38:08 - loss: 3.1815��������������������������������������������������������������������������
    8/31250 [..............................] - ETA: 4:19:09 - loss: 3.2167��������������������������������������������������������������������������
    9/31250 [..............................] - ETA: 4:06:03 - loss: 3.2543��������������������������������������������������������������������������
   <Snipped>
15548/31250 [=============>................] - ETA: 1:08:47 - loss: 3.2829��������������������������������������������������������������������������
15549/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2828��������������������������������������������������������������������������
15550/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2828��������������������������������������������������������������������������
15551/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2828��������������������������������������������������������������������������
15552/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2827��������������������������������������������������������������������������
15553/31250 [=============>................] - ETA: 1:08:45 - loss: 3.2828��������������������������������������������������������������������������
15554/31250 [=============>................] - ETA: 1:08:45 - loss: nan   �����������������������������������������������������������������������
15555/31250 [=============>................] - ETA: 1:08:45 - loss: nan�����������������������������������������������������������������������
15556/31250 [=============>................] - ETA: 1:08:45 - loss: nan�����������������������������������������������������������������������
15557/31250 [=============>................] - ETA: 1:08:44 - loss: nan�����������������������������������������������������������������������
15558/31250 [=============>................] - ETA: 1:08:44 - loss: nan�����������������������������������������������������������������������
15559/31250 [=============>................] - ETA: 1:08:44 - loss: nan�����������������������������������������������������������������������
<Snipped>
31248/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31249/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31250/31250 [==============================] - 8239s 264ms/step - loss: nan
2020-06-11 13:47:28.196777: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418


Epoch 00001: f1 improved from -inf to 0.00000, saving model to ./checkpoint
Epoch 2/15

    1/31250 [..............................] - ETA: 2:40:55 - loss: nan�����������������������������������������������������������������������
    2/31250 [..............................] - ETA: 2:54:27 - loss: nan�����������������������������������������������������������������������
    3/31250 [..............................] - ETA: 2:37:31 - loss: nan�����������������������������������������������������������������������
    4/31250 [..............................] - ETA: 2:35:34 - loss: nan�����������������������������������������������������������������������
    5/31250 [..............................] - ETA: 2:28:59 - loss: nan�����������������������������������������������������������������������
    6/31250 [..............................] - ETA: 2:30:40 - loss: nan�����������������������������������������������������������������������
    7/31250 [..............................] - ETA: 2:28:10 - loss: nan�����������������������������������������������������������������������
    8/31250 [..............................] - ETA: 2:27:05 - loss: nan�����������������������������������������������������������������������
    9/31250 [..............................] - ETA: 2:27:35 - loss: nan�����������������������������������������������������������������������
   <Snipped>
31244/31250 [============================>.] - ETA: 1s - loss: nan������������������������������������������������������������������
31245/31250 [============================>.] - ETA: 1s - loss: nan������������������������������������������������������������������
31246/31250 [============================>.] - ETA: 1s - loss: nan������������������������������������������������������������������
31247/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31248/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31249/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31250/31250 [==============================] - 8214s 263ms/step - loss: nan
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418


Epoch 00002: f1 did not improve from 0.00000
Epoch 3/15

    1/31250 [..............................] - ETA: 2:16:22 - loss: nan�����������������������������������������������������������������������
    2/31250 [..............................] - ETA: 2:17:49 - loss: nan�����������������������������������������������������������������������
    3/31250 [..............................] - ETA: 2:28:30 - loss: nan�����������������������������������������������������������������������
    4/31250 [..............................] - ETA: 2:54:15 - loss: nan�����������������������������������������������������������������������
    5/31250 [..............................] - ETA: 2:46:57 - loss: nan�����������������������������������������������������������������������
    6/31250 [..............................] - ETA: 2:53:14 - loss: nan�����������������������������������������������������������������������
    7/31250 [..............................] - ETA: 2:44:51 - loss: nan�����������������������������������������������������������������������
    8/31250 [..............................] - ETA: 2:38:22 - loss: nan�����������������������������������������������������������������������
    9/31250 [..............................] - ETA: 2:34:27 - loss: nan�����������������������������������������������������������������������
   <Snipped>
31246/31250 [============================>.] - ETA: 1s - loss: nan������������������������������������������������������������������
31247/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31248/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31249/31250 [============================>.] - ETA: 0s - loss: nan������������������������������������������������������������������
31250/31250 [==============================] - 8214s 263ms/step - loss: nan
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418


Epoch 00003: f1 did not improve from 0.00000
Epoch 4/15

    1/31250 [..............................] - ETA: 1:57:41 - loss: nan�����������������������������������������������������������������������
    2/31250 [..............................] - ETA: 2:05:43 - loss: nan�����������������������������������������������������������������������
    3/31250 [..............................] - ETA: 2:09:32 - loss: nan�����������������������������������������������������������������������
    4/31250 [..............................] - ETA: 2:10:20 - loss: nan�����������������������������������������������������������������������
    <Snipped>
31250/31250 [==============================] - 8214s 263ms/step - loss: nan
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418


Epoch 00004: f1 did not improve from 0.00000

Exited here....

Specify versions of the following libraries

deepsegment / latest
tensorflow-gpu / 1.14.0
keras / 2.3.1
docker image / tensorflow/tensorflow:1.14.0-gpu-py3-jupyter

Expected behavior
I was hoping to get the model to improve and not have nan loss.

Screenshots
Nope

Italian language option

Hi,
as stated in the medium blog (https://medium.com/@praneethbedapudi/deepsegment-2-0-multilingual-text-segmentation-with-vector-alignment-fd76ce62194f) seems that in previuous commits there was an option for the Italian language.

Do you plan to re-add the Italian language as an option for the package?

Thanks

not segmenting properly.

from deepsegment import DeepSegment
segmenter = DeepSegment('en')
segmenter.segment('I am Batman i live in gotham')

It gives me this ['I am Batman', 'i live in gotham']

but when i give it

from deepsegment import DeepSegment
segmenter = DeepSegment('en')
segmenter.segment('hello what is your name how are you ')

then it gives me
'hello what is your name who are you'

what would you advise?

Facing error when running sample script

I was trying to set up the library to test the accuracy for my usecase.
I did the package installs and created a script containing just the content from the readme
When executing it i am facing the following error

Traceback (most recent call last):
File "correct.py", line 4, in
print(segmenter.segment('I am Batman i live in gotham'))
File "/home/algante/tmp/deepspeech-venv/lib/python3.6/site-packages/deepsegment/deepsegment.py", line 149, in segment
all_tags = get_tf_serving_respone(DeepSegment.seqtag_model, encoded_sents)
File "/home/algante/tmp/deepspeech-venv/lib/python3.6/site-packages/deepsegment/deepsegment.py", line 73, in get_tf_serving_respone
response =stub.Predict(request, 20)
File "/home/algante/tmp/deepspeech-venv/lib/python3.6/site-packages/grpc/_channel.py", line 604, in call
return _end_unary_response_blocking(state, call, False, None)
File "/home/algante/tmp/deepspeech-venv/lib/python3.6/site-packages/grpc/_channel.py", line 506, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1570094221.600498100","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3876,"referenced_errors":[{"created":"@1570094221.600481900","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":395,"grpc_status":14}]}"

Can someone help with this?

Support the new version of Keras 2.4

Will there be an update to the library to support the new version of Keras 2.4?

I have encountered a problem that when I create a microservice based on deep segment and flask I get an error, which is based on the fact that Keras 2.3.1 does not work with multi threading

error that occurs looks like this:

File "/keras/engine/training.py", line 1452, in predict
    if self._uses_dynamic_learning_phase():
  File "/keras/engine/training.py", line 382, in _uses_dynamic_learning_phase
    not isinstance(K.learning_phase(), int))
  File "/keras/backend/tensorflow_backend.py", line 73, in symbolic_fn_wrapper
    if _SYMBOLIC_SCOPE.value:
AttributeError: '_thread._local' object has no attribute 'value'

Unable to load "eng_fra_ita" with DeepSegment

Hello, I am trying to load the eng_fra_ita model with the current branch v2, but it is not made available. Is there a way to load it?

P.S. I also dowloaded the zip file from deepsegment-model but there is no "utils" and "params" files.

Thank you

Support for German

Is the support for german language in your schedule?

load checkpoint from local files

I try to load the checkpoint locally using this:
seg = DeepSegment(lang_code=None, checkpoint_path="path/to/checkpoint")

but it does not work as it will always download the checkpoint.

what is the best way to do that?

notai-tech / deepsegment Goto Github PK

deepsegment's People

Contributors

Stargazers

Watchers

Forkers

deepsegment's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs