google / sentencepiece Goto Github PK

View Code? Open in Web Editor NEW

9.6K 123.0 1.1K 27.62 MB

Unsupervised text tokenizer for Neural Network-based text generation.

License: Apache License 2.0

Perl 0.10% C++ 96.68% Python 1.03% CMake 0.78% Shell 0.01% Jupyter Notebook 0.56% SWIG 0.84%

neural-machine-translation natural-language-processing word-segmentation

sentencepiece's Issues

Fail to install Python Wrapper

When run pip install sentencepiece in python 3.6.1, I got this error:

  building '_sentencepiece' extension
  creating build/temp.linux-x86_64-3.6
  gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/yuchang/miniconda3/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.linux-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/usr/local/include
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Load(PyObject*, PyObject*)’:
  sentencepiece_wrap.cxx:3305:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
         PyString_AsStringAndSize(obj1, &str, &str_size);
                                                       ^
  sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_LoadOrDie(PyObject*, PyObject*)’:
  sentencepiece_wrap.cxx:3347:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
         PyString_AsStringAndSize(obj1, &str, &str_size);
                                                       ^

I googled the "PyString_AsStringAndSize", some one said that it means the code does not support python3 , so I change the python version to 2.7.12, then I got:

Collecting sentencepiece
  Using cached sentencepiece-0.0.0.tar.gz
    Complete output from command python setup.py egg_info:
    Package sentencepiece was not found in the pkg-config search path.
    Perhaps you should add the directory containing `sentencepiece.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'sentencepiece' found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-joX_PJ/sentencepiece/setup.py", line 28, in <module>
        cmd('pkg-config sentencepiece --cflags'),
      File "/tmp/pip-build-joX_PJ/sentencepiece/setup.py", line 14, in cmd
        return os.popen(line).readlines()[0][:-1].split()
    IndexError: list index out of range
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-joX_PJ/sentencepiece/

My OS is Ubuntu 16.04, python 3.6.1 installed by miniconda3 , python 2.7.12 installed with system.

Any advise or guidance would be greatly appreciated.

Failed to install python wrapper on MacOS (python 3)

pip install sentencepiece

Collecting sentencepiece
Using cached https://files.pythonhosted.org/packages/ef/ba/17c0c4f8ccc746b2182c7e3c8292be0bdb37fbadeaf467d2f69565160764/sentencepiece-0.0.7.tar.gz
Building wheels for collected packages: sentencepiece
Running setup.py bdist_wheel for sentencepiece ... error
Complete output from command /Users/vostryakov/projects/3env/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/s4/l8hg1ch969d9p96z4bwfllmc0000gp/T/pip-install-gni78bgz/sentencepiece/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /private/var/folders/s4/l8hg1ch969d9p96z4bwfllmc0000gp/T/pip-wheel-60ftqmg6 --python-tag cp36:
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-10.11-x86_64-3.6
copying sentencepiece.py -> build/lib.macosx-10.11-x86_64-3.6
running build_ext
building '_sentencepiece' extension
creating build/temp.macosx-10.11-x86_64-3.6
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.macosx-10.11-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/usr/local/include
sentencepiece_wrap.cxx:3123:10: fatal error: 'sentencepiece_trainer.h' file not found
#include <sentencepiece_trainer.h>
^
1 error generated.
error: command 'clang' failed with exit status 1

Failed building wheel for sentencepiece

The same error when I try to build from a source.

Symbols in sentencepieces

I am using sentencepieces in python and have a issue with user defined symbols.

For example, I trained with

"spm.SentencePieceTrainer.Train('--input=Stream.tsv --model_prefix=m.debug --vocab_size=1000 --input_sentence_size=100000000 --hard_vocab_limit=false --user_defined_symbols=,<
sep>,,,')"

and use

>>> sp.EncodeAsIds('<s>')
[9, 1]
>>> sp.EncodeAsIds('<pad>')
[9, 5]
>>> sp.EncodeAsIds('a<pad>')
[25, 5]

it always adds "9" (which is "") in the id list except "a". Is it expected? or Any way to remove the "" id?

Do you have any pre-trained models?

I just want to have a good model for English and don't want to tr it myself. Is there any official models for this?

Tokenize words, rather than wordparts

First of all, thanks for sharing such a useful too! I really like this library.

Second, I'm working on a non-translation task where I think I want to be workings with words, rather than word parts. Are there any settings I can use in sentencepiece to tend to favor longer word units?

Build error of Python module for Python 3.5.3

Hi,

When building the Python module of sentencepiece for Python 3.5.3, I saw the following build error "‘PyString_AsStringAndSize’ was not declared in this scope".

In the same environment, I successfully built the Python module of sentencepiece for Python 2.7.13, thus, I believe that my environment meets build dependency of the Python module of sentencepiece.

Could you give me an advice to avoid this problem?

$ pip3 --no-cache-dir install --user sentencepiece
Collecting sentencepiece
  Downloading sentencepiece-0.0.0.tar.gz (183kB)
    100% |████████████████████████████████| 184kB 6.2MB/s 
Installing collected packages: sentencepiece
  Running setup.py install for sentencepiece ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-yqcu2vo3/sentencepiece/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-oqqfb0ui-record/install-record.txt --single-version-externally-managed --compile --user --prefix=:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.5
    copying sentencepiece.py -> build/lib.linux-x86_64-3.5
    running build_ext
    building '_sentencepiece' extension
    creating build/temp.linux-x86_64-3.5
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fdebug-prefix-map=/build/python3.5-MLq5fN/python3.5-3.5.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -c sentencepiece_wrap.cxx -o build/temp.linux-x86_64-3.5/sentencepiece_wrap.o -std=c++11 -g -O2 -fdebug-prefix-map=/home/tsuchiya/work/pkg-sentencepiece=. -fstack-protector-strong -Wformat -Werror=format-security
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Load(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3305:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_LoadOrDie(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3347:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_SetEncodeExtraOptions(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3389:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_SetDecodeExtraOptions(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3431:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_PieceToId(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3496:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_IdToPiece(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3543:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Encode(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3665:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx:3677:95: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         PyList_SetItem(resultobj, i, PyString_FromStringAndSize(result[i].data(), result[i].size()));
                                                                                                   ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_EncodeAsPieces(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3712:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx:3724:95: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         PyList_SetItem(resultobj, i, PyString_FromStringAndSize(result[i].data(), result[i].size()));
                                                                                                   ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_EncodeAsIds(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3759:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Decode(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3811:54: error: ‘PyString_AsStringAndSize’ was not declared in this scope
               PyString_AsStringAndSize(o, &str, &str_size);
                                                          ^
    sentencepiece_wrap.cxx:3826:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_DecodePieces(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3866:54: error: ‘PyString_AsStringAndSize’ was not declared in this scope
               PyString_AsStringAndSize(o, &str, &str_size);
                                                          ^
    sentencepiece_wrap.cxx:3881:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_DecodeIds(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3933:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor___getitem__(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3990:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-yqcu2vo3/sentencepiece/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-oqqfb0ui-record/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-build-yqcu2vo3/sentencepiece/

Question on vocab / vocab size

Hi,
I got question.
say I do the following if I have a training corpus train.de / train.en [4500000 lines each]
I concat the 2 files to build the model.
vocab_size=32000
I end up with a vocab file of 32000 lines, so far so good.
I encode train.en and train.de with the model.
If I look at the "vocabulary" of each tokenized file, I end up with more than 32000 for both of them. (39345 and 38299)
What's wrong ?
my understanding is that if total lines are less than 10M then the full corpus is taken into account and therefore I should end up with 32000 max.
thanks.

instialltion failed on ubuntu

I installed protobuf 3.4 after the apt-get version failed.

./autogen.sh

Running aclocal ...
Running autoheader...
Running libtoolize ..
Running automake ...
Running autoconf ...

./configure

checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking how to print strings... printf
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking dependency style of gcc... gcc3
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking for ar... ar
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether g++ accepts -g... (cached) yes
checking dependency style of g++... (cached) gcc3
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking whether gcc understands -c and -o together... (cached) yes
checking dependency style of gcc... (cached) gcc3
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for PROTOBUF... yes
checking nfkc-compile option... no
checking gcov option... no
configure: pkgconfig directory is ${libdir}/pkgconfig
checking for unistd.h... (cached) yes
checking for size_t... yes
checking for working strtod... yes
checking for memchr... yes
checking for memset... yes
checking that generated files are newer than configure... done

make

make all-recursive
make[1]: Entering directory '/home/dori/src/sentencepiece'
Making all in src
make[2]: Entering directory '/home/dori/src/sentencepiece/src'
make all-am
make[3]: Entering directory '/home/dori/src/sentencepiece/src'
g++ -DHAVE_CONFIG_H -I. -I.. -std=c++11 -Wall -O3 -I/usr/include/google/protobuf -D_THREAD_SAFE -MT builder.o -MD -MP -MF .deps/builder.Tpo -c -o builder.o builder.cc
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
#error This file was generated by an older version of protoc which is
^
sentencepiece_model.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
#error incompatible with your Protocol Buffer headers. Please
^
sentencepiece_model.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
#error regenerate this file with a newer version of protoc.
^
Makefile:916: recipe for target 'builder.o' failed
make[3]: *** [builder.o] Error 1
make[3]: Leaving directory '/home/dori/src/sentencepiece/src'
Makefile:686: recipe for target 'all' failed
make[2]: *** [all] Error 2
make[2]: Leaving directory '/home/dori/src/sentencepiece/src'
Makefile:476: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/dori/src/sentencepiece'
Makefile:385: recipe for target 'all' failed
make: *** [all] Error 2

Bad alloc

Hello, when running the following on a file with over 5mln sentences

spm.SentencePieceTrainer.Train('--input=... --model_prefix=prefix --vocab_size=50000')

I'm getting this error:

trainer_interface.cc(235) LOG(INFO) Done! 5175523 sentences are loaded unigram_model_trainer.cc(117) LOG(INFO) Using 2000000 sentences for making seed sentencepieces terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

Is this memory issue? How much RAM do I need to execute this operation?

Build failure due to wrong google/protobuf/descriptor.proto path generation

On my machine building sentencepiece fails due to generating the path to google/protobuf/descriptor.proto wrongly.

Executing the following commands

% cd /path/to/sentencepiece
% ./autogen.sh
% ./configure
% make
% make check

results in the following failure:

Making check in src
make[1]: Entering directory '/home/leonard/software/sentencepiece/src'
make  check-am
make[2]: Entering directory '/home/leonard/software/sentencepiece/src'
protoc --cpp_out=. .//usr/include/google/protobuf/descriptor.proto
.//usr/include/google/protobuf/descriptor.proto: No such file or directory
make[2]: *** [Makefile:1350: /usr/include/google/protobuf/descriptor.pb.h] Error 1
make[2]: Leaving directory '/home/leonard/software/sentencepiece/src'
make[1]: *** [Makefile:1206: check] Error 2
make[1]: Leaving directory '/home/leonard/software/sentencepiece/src'
make: *** [Makefile:470: check-recursive] Error 1

Note that /usr/include/google/protobuf/descriptor.proto exists, but that the path .//usr/include/google/protobuf/descriptor.proto does not.

I'm using the following software versions

➜  ~ aclocal --version
aclocal (GNU automake) 1.15

➜  ~ autoheader --version
autoheader (GNU Autoconf) 2.69

➜  ~ automake --version
automake (GNU automake) 1.15

➜  ~ autoconf --version
autoconf (GNU Autoconf) 2.69

➜  ~ protoc --version
libprotoc 3.4.0

Is this the tokenizer used in the official Transformer implementation?

Google's official Transformer implementation comes with a subtokenizer that appears to work extremely similarly to SentencePiece, breaking words into subwords and using _ added to tokens instead of whitespace, and training the vocab in an unsupervised fashion with a pre-fixed target vocab size (32k in the code). I was wondering if you guys knew if it was actually SentencePiece or otherwise, if you had a bit of context on the differences.

https://github.com/tensorflow/models/tree/master/official/transformer
https://github.com/tensorflow/models/blob/master/official/transformer/utils/tokenizer.py

Thanks a lot in advance.

Encode() in sentencePieceProcessor doesn't take unicode strings as input

Hi,
The Encode() function in the sentencePieceProcessor module doesn't take unicode strings as inputs. Could anyone please suggest an alternative way?

p.s. I am using the python wrapper of sentencepiece.

Thanks in advance.

Sentence encoder having issues with Python 2.7.6

We are running into issues using sentencepiece and text encoder in python 2.7.6 and tensorflow 1.8

module = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-lite/2")

File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/module.py", line 105, in init
self._spec = as_module_spec(spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/module.py", line 31, in as_module_spec
return native_module.load_module_spec(spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/native_module.py", line 99, in load_module_spec
path = compressed_module_resolver.get_default().get_module_path(path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 385, in get_module_path
return self._get_module_path(handle)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 467, in _get_module_path
return resolver.get_module_path(handle)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 385, in get_module_path
return self._get_module_path(handle)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/compressed_module_resolver.py", line 105, in _get_module_path
self._lock_file_timeout_sec())
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 313, in atomic_download
download_fn(handle, tmp_dir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/compressed_module_resolver.py", line 101, in download
response = url_opener.open(request)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] _ssl.c:510: EOF occurred in violation of protocol>

Is the module not supported in python 2.7, it seems to work fine in python3

libprotobuf-c++ does not exist

In README.md, you recommend to install libprotobuf-c++ if libprotobuf9v5 cannot be found but libprotobuf-c++ does not exist in the Ubuntu package repositories.

make check failed if I rewrite the unk value

This is rather a feature request than an issue.

I want to reserve small word IDs for special purpose and so I set the unk ID to 3.
const uint32 ModelInterface::kUnkID = 3;

The code was built, however, make check failed.

Purpose of adding a dummy whitespace at the beginning of each line of sentence

I have seen in the help text of spm_train the following about the parameter:

--add_dummy_prefix (Add dummy whitespace at the beginning of text) type: bool default: true

Is there any explanation that why the default behavior is adding a prefix whitespace? I am just wondering what's the intention or advantages of doing this.

terminate called after throwing an instance of 'Darts::Details::Exception'

After I train the model, and load the model...

Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sentencepiece as spm
>>> 
>>> sp = spm.SentencePieceProcessor()
>>> 
>>> sp.Load('m.model')
terminate called after throwing an instance of 'Darts::Details::Exception'
  what():  /sentencepiece/third_party/darts_clone/darts.h:1143: exception: failed to insert key: zero-length key
Aborted (core dumped)

What should I do?
Thank you.

Core Dumped during Saving model

I installed sentencepiece successfully in Ubuntu 14.04 64 bit.

But when I tried to train this model with this simple input.txt file, Core Dumped happens during Saving model.

Here are the full log

$ spm_train --input=input.txt --model_prefix=m_a --vocab_size=1000

unigram_model_trainer.cc(494) LOG(INFO) Starts training with : 
input: "input.txt"
model_prefix: "m_a"
model_type: UNIGRAM
vocab_size: 8000
character_coverage: 0.9995
input_sentence_size: 10000000
mining_sentence_size: 2000000
training_sentence_size: 10000000
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: true
split_by_whitespace: true

trainer_interface.cc(109) LOG(INFO) Loading corpus: input.txt
trainer_interface.cc(126) LOG(INFO) Loading: ▁Kết▁quả▁xổ▁số▁điện▁toán▁Vietlott▁ngày▁6/2/2017	size=0
trainer_interface.cc(148) LOG(INFO) Loaded 45 sentences
trainer_interface.cc(166) LOG(INFO) all chars count=14425
trainer_interface.cc(173) LOG(INFO) Done: 99.9584% characters are covered.
trainer_interface.cc(181) LOG(INFO) alphabet size=134
trainer_interface.cc(211) LOG(INFO) Done! 45 sentences are loaded
unigram_model_trainer.cc(121) LOG(INFO) Using 45 sentences for making seed sentencepieces
unigram_model_trainer.cc(149) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(153) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(204) LOG(INFO) Initialized 830 seed sentencepieces
trainer_interface.cc(215) LOG(INFO) Tokenizing input sentences with whitespace: 45
trainer_interface.cc(224) LOG(INFO) Done! 787
unigram_model_trainer.cc(513) LOG(INFO) Using 787 sentences for EM training
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=606 obj=12.9723 num_tokens=1859 num_tokens/piece=3.06766
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=530 obj=12.0251 num_tokens=1862 num_tokens/piece=3.51321
trainer_interface.cc(284) LOG(INFO) Saving model: m_a.model
trainer_interface.cc(275) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] 
Aborted (core dumped)

However, if I set an appropriate value for vocab_size, It works

$ spm_train --input=input.txt --model_prefix=m_a --vocab_size=200

unigram_model_trainer.cc(494) LOG(INFO) Starts training with : 
input: "input.txt"
model_prefix: "m_a"
model_type: UNIGRAM
vocab_size: 200
character_coverage: 0.9995
input_sentence_size: 10000000
mining_sentence_size: 2000000
training_sentence_size: 10000000
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: true
split_by_whitespace: true

trainer_interface.cc(109) LOG(INFO) Loading corpus: input.txt
trainer_interface.cc(126) LOG(INFO) Loading: ▁Kết▁quả▁xổ▁số▁điện▁toán▁Vietlott▁ngày▁6/2/2017	size=0
trainer_interface.cc(148) LOG(INFO) Loaded 45 sentences
trainer_interface.cc(166) LOG(INFO) all chars count=14425
trainer_interface.cc(173) LOG(INFO) Done: 99.9584% characters are covered.
trainer_interface.cc(181) LOG(INFO) alphabet size=134
trainer_interface.cc(211) LOG(INFO) Done! 45 sentences are loaded
unigram_model_trainer.cc(121) LOG(INFO) Using 45 sentences for making seed sentencepieces
unigram_model_trainer.cc(149) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(153) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(204) LOG(INFO) Initialized 830 seed sentencepieces
trainer_interface.cc(215) LOG(INFO) Tokenizing input sentences with whitespace: 45
trainer_interface.cc(224) LOG(INFO) Done! 787
unigram_model_trainer.cc(513) LOG(INFO) Using 787 sentences for EM training
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=606 obj=12.9723 num_tokens=1859 num_tokens/piece=3.06766
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=530 obj=12.0251 num_tokens=1862 num_tokens/piece=3.51321
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=397 obj=12.2622 num_tokens=1975 num_tokens/piece=4.97481
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=397 obj=12.1277 num_tokens=1975 num_tokens/piece=4.97481
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=297 obj=12.9592 num_tokens=2182 num_tokens/piece=7.3468
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=297 obj=12.7479 num_tokens=2182 num_tokens/piece=7.3468
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=222 obj=14.1593 num_tokens=2467 num_tokens/piece=11.1126
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=222 obj=13.8631 num_tokens=2467 num_tokens/piece=11.1126
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=220 obj=14.0721 num_tokens=2483 num_tokens/piece=11.2864
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=220 obj=14.0661 num_tokens=2493 num_tokens/piece=11.3318
trainer_interface.cc(284) LOG(INFO) Saving model: m_a.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m_a.vocab

I tried another value of vocab_size with my input data,

$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=400
-> OK
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=500
-> OK
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=600
-> Core Dumped
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=547
-> OK
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=548
-> Core Dumped

How can I chose appropriate value for vocab_size?

Mac/Ubuntu Installation Fails `syntax error near unexpected token `PROTOBUF,'`

Out:

$ brew install protobuf autoconf automake libtool
Warning: protobuf 3.3.2 is already installed
Warning: autoconf 2.69 is already installed
Warning: automake 1.15.1 is already installed
Warning: libtool 2.4.6_1 is already installed
$ ./autogen.sh
Running aclocal ...
Running autoheader...
Running libtoolize ..
Running automake ...
Running autoconf ...
$ ./configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... ./install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking build system type... x86_64-apple-darwin16.6.0
checking host system type... x86_64-apple-darwin16.6.0
checking how to print strings... printf
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking dependency style of gcc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc... /Library/Developer/CommandLineTools/usr/bin/ld
checking if the linker (/Library/Developer/CommandLineTools/usr/bin/ld) is GNU ld... no
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 196608
checking how to convert x86_64-apple-darwin16.6.0 file names to x86_64-apple-darwin16.6.0 format... func_convert_file_noop
checking how to convert x86_64-apple-darwin16.6.0 file names to toolchain format... func_convert_file_noop
checking for /Library/Developer/CommandLineTools/usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking for ar... ar
checking for archiver @FILE support... no
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... no
checking if : is a manifest tool... no
checking for dsymutil... dsymutil
checking for nmedit... nmedit
checking for lipo... lipo
checking for otool... otool
checking for otool64... no
checking for -single_module linker flag... yes
checking for -exported_symbols_list linker flag... yes
checking for -force_load linker flag... yes
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... yes
checking for gcc option to produce PIC... -fno-common -DPIC
checking if gcc PIC flag -fno-common -DPIC works... yes
checking if gcc static flag -static works... no
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/Library/Developer/CommandLineTools/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... darwin16.6.0 dyld
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /Library/Developer/CommandLineTools/usr/bin/ld
checking if the linker (/Library/Developer/CommandLineTools/usr/bin/ld) is GNU ld... no
checking whether the g++ linker (/Library/Developer/CommandLineTools/usr/bin/ld) supports shared libraries... yes
checking for g++ option to produce PIC... -fno-common -DPIC
checking if g++ PIC flag -fno-common -DPIC works... yes
checking if g++ static flag -static works... no
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/Library/Developer/CommandLineTools/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... darwin16.6.0 dyld
checking how to hardcode library paths into programs... immediate
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether g++ accepts -g... (cached) yes
checking dependency style of g++... (cached) gcc3
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking whether gcc understands -c and -o together... (cached) yes
checking dependency style of gcc... (cached) gcc3
./configure: line 17067: syntax error near unexpected token `PROTOBUF,'
./configure: line 17067: `PKG_CHECK_MODULES(PROTOBUF, protobuf >= 2.4.0)'

Brew protobuf info:

Michaels-MacBook-Pro:sentencepiece petrochuk$ brew info protobuf
protobuf: stable 3.3.2 (bottled), HEAD
Protocol buffers (Google's data interchange format)
https://github.com/google/protobuf/
/usr/local/Cellar/protobuf/3.3.2 (260 files, 16.1MB) *
  Poured from bottle on 2017-08-02 at 17:17:01
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/protobuf.rb
==> Dependencies
Build: autoconf ✔, automake ✔, libtool ✔
==> Requirements
Optional: python3 ✔
==> Options
--with-python3
	Build with python3 support
--with-test
	Run build-time check
--without-python
	Build without python support
--HEAD
	Install HEAD version
==> Caveats
Editor support and examples have been installed to:
  /usr/local/opt/protobuf/share/doc/protobuf

Python modules have been installed and Homebrew's site-packages is not
in your Python sys.path, so you will not be able to import the modules
this formula installed. If you plan to develop with these modules,
please run:
  mkdir -p /Users/petrochuk/Library/Python/2.7/lib/python/site-packages
  echo 'import site; site.addsitedir("/usr/local/lib/python2.7/site-packages")' >> /Users/petrochuk/Library/Python/2.7/lib/python/site-packages/homebrew.pth

Trouble with reversibility

Thanks for this great tool. For context, I'm working with the python wrapper for the BPE tokenization, and I would like to write my tokenized input to files line by line.

Using the default normalization settings, it looks like I can't get complete (character by character) reversibility for some special tokens. If I turn the normalization off by setting --normalization_rule_name=identity, I get all sorts of odd tokenizations.

### Excerpt from a tokenization script that tries to encode and write line by line to a file
input_line = input_line.strip()
tokenized_line = [x.decode('utf-8') for x in spp.EncodeAsPieces(input_line)] # need to convert to strings at some point
encoded_output_line = ' '.join(tokenized_line) + '\n'
decoded_input_line = spp.DecodePieces([x.encode() for x in encoded_output_line.split()])
if not input_line == decoded_input_line:
    print("input_line: ", input_line)
    print("decoded_input_line: ", decoded_input_line)
outfile.write(encoded_output_line)

This yields things like the following:

input_line: Ich erkläre die am Donnerstag, dem 25. September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

decoded_input_line: Ich erkläre die am Donnerstag, dem 25.September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

See how the space between "25. September" was removed? Looks like spaces are getting removed in these examples as well (this is happening to many, many sentences):

input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40 % der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

decoded_input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40% der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

Here there is one where the space is removed between words (rather than just punctuation):

input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays.

decoded_input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentielquant à moi-même, mon engagement politique et mon pays.

I was hoping to get BPE, subword tokenizations from sentencepiece that were completely reversible, so that I could get back to the exact original input string. But, I'd also like to be able to cache files and write the BPE encoded inputs to a file. Is this possible, either with a different sentencepiece model or with a different method of writing to the file?

Restarting sentencepiece on a new dataset

I have a corpus of .xz files from the common crawl. I don't have enough disk space to unzip all the files and concatenate them into a single file. Is there any way to re-start the sentencepiece model on a new corpus of text?

I'd like to loop through each of my files, unzip them to a temp file, run feed it to the tokenization model, delete the temp file, and then move on to the next one.

Make fails on Ubuntu 16.04(LTS)

I tried installing with reference to Build and Install SentencePiece, but Make does not pass.

The environment is as follows:

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
$ uname -r
4.4.0-64-generic

The result of Make is as follows.

$ sudo ldconfig
$ make
make  all-recursive
make[1]: Entering directory '/home/ornew/sentencepiece'
Making all in src
make[2]: Entering directory '/home/ornew/sentencepiece/src'
make  all-am
make[3]: Entering directory '/home/ornew/sentencepiece/src'
/bin/bash ../libtool  --tag=CXX   --mode=link g++  -std=c++11 -Wall -O3 -pthread   -o spm_encode spm_encode_main.o libsentencepiece.la -lprotobuf -pthread -lpthread
libtool: link: g++ -std=c++11 -Wall -O3 -pthread -o .libs/spm_encode spm_encode_main.o -pthread  ./.libs/libsentencepiece.so -lprotobuf -lpthread -pthread
spm_encode_main.o: In function `std::_Function_handler<void (std::string const&), main::{lambda(std::string const&)#3}>::_M_invoke(std::_Any_data const&, std::string const&)':
spm_encode_main.cc:(.text+0x1df): undefined reference to `google::protobuf::Message::Utf8DebugString() const'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::empty_string_'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::ReadString(google::protobuf::io::CodedInputStream*, std::string*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::ReadBytes(google::protobuf::io::CodedInputStream*, std::string*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteBytesMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::DescriptorPool::FindFileByName(std::string const&) const'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteString(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::Message::InitializationErrorString() const'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::StringTypeHandlerBase::Delete(std::string*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::MessageFactory::InternalRegisterGeneratedFile(char const*, void (*)(std::string const&))'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteStringMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::StringTypeHandlerBase::New()'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::io::CodedOutputStream::WriteStringWithSizeToArray(std::string const&, unsigned char*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::Message::GetTypeName() const'
collect2: error: ld returned 1 exit status
Makefile:834: recipe for target 'spm_encode' failed
make[3]: *** [spm_encode] Error 1
make[3]: Leaving directory '/home/ornew/sentencepiece/src'
Makefile:678: recipe for target 'all' failed
make[2]: *** [all] Error 2
make[2]: Leaving directory '/home/ornew/sentencepiece/src'
Makefile:418: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/ornew/sentencepiece'
Makefile:350: recipe for target 'all' failed
make: *** [all] Error 2

It seems that there is no package named libprotobuf-c++ and protocolbuffer.

$ sudo apt-get upgrade -y
$ sudo apt-get update
$ sudo apt-get install libprotobuf-c++ protocolbuffer
Reading package lists... Done
Building dependency tree
Reading state information... Done
Note, selecting 'libprotobuf-c-dev' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c0-dev' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c1' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c1-dbg' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c-dev' instead of 'libprotobuf-c0-dev'
E: Unable to locate package protocolbuffer

Although ProtocolBuffer seems to be installed...

$ dpkg -l | grep protobuf
ii  libmirprotobuf3:amd64                  0.21.0+16.04.20160330-0ubuntu1                         amd64        Display server for Ubuntu - RPC definitions
ii  libprotobuf-c-dev                      1.2.1-1                                                amd64        Protocol Buffers C static library and headers (protobuf-c)
ii  libprotobuf-c1                         1.2.1-1                                                amd64        Protocol Buffers C shared library (protobuf-c)
ii  libprotobuf-c1-dbg                     1.2.1-1                                                amd64        Protocol Buffers C shared library debug symbols (protobuf-c)
ii  libprotobuf-dev:amd64                  2.6.1-1.3                                              amd64        protocol buffers C++ library (development files)
ii  libprotobuf-java                       2.6.1-1.3                                              all          Java bindings for protocol buffers
ii  libprotobuf-lite9v5:amd64              2.6.1-1.3                                              amd64        protocol buffers C++ library (lite version)
ii  libprotobuf9v5:amd64                   2.6.1-1.3                                              amd64        protocol buffers C++ library
ii  protobuf-c-compiler                    1.2.1-1                                                amd64        Protocol Buffers C compiler (protobuf-c)
ii  protobuf-compiler                      2.6.1-1.3                                              amd64        compiler for protocol buffer definition files

What should I do?
Thank you.

Saving model error with user defined tags

Hi,
First of all, I have read this closed issue, about saving model error but i am pretty sure, that my corpus is big enough (contains 36890548 lines, and i would like to generate a 32k vocab).
Can you help me solve this issue? (Ubuntu 16.04)
This is the command i ran:

sudo spm_train --user_defined_symbols=city --input=corpus_max_t1.en --model_prefix=spm.en --vocab_size=32000 --model_type=bpe --input_sentence_size=37000000

This is the end of the output:

bpe_model_trainer.cc(254) LOG(INFO) Added: freq=370 size=31700 all=805279 active=40873 piece=▁Mystery
bpe_model_trainer.cc(163) LOG(INFO) Updating active symbols. max_freq=370 min_freq=94
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=370 size=31720 all=805278 active=40262 piece=▁multiplied
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=369 size=31740 all=805577 active=40561 piece=▁Agric
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=369 size=31760 all=805579 active=40563 piece=▁realtor
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=368 size=31780 all=805662 active=40646 piece=▁eel
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=368 size=31800 all=805899 active=40883 piece=▁Arturo
bpe_model_trainer.cc(163) LOG(INFO) Updating active symbols. max_freq=368 min_freq=93
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=368 size=31820 all=805940 active=40336 piece=▁Carlisle
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=367 size=31840 all=806037 active=40433 piece=15.
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=367 size=31860 all=806749 active=41145 piece=▁101.
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=367 size=31880 all=806955 active=41351 piece=▁shabby
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=366 size=31900 all=807144 active=41540 piece=▁TMZ
bpe_model_trainer.cc(163) LOG(INFO) Updating active symbols. max_freq=366 min_freq=93
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=366 size=31920 all=807426 active=40638 piece=ylamine
trainer_interface.cc(314) LOG(INFO) Saving model: spm.en.model
trainer_interface.cc(278) [dup.insert(piece).second] city is already defined
Aborted (core dumped)

Suppression of isolated ▁'s

To maximize likelihood, there are cases where subword tokens that are seen after a space (meaning the start of a "word") do not get the special underscore because they also appear in the middle of a character combination somewhere else.

Is it possible to suppress this behavior? Meaning, we don't want to have an isolated ▁ as part of the generated vocab list.

Python wrapper build is broken for Python 3

Commit 9d82cbd breaks the build of the python wrapper for Python 3. With Python 2, it compiles without problem.

This is the error obtained for python setup.py build:

python setup.py build
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.5
copying sentencepiece.py -> build/lib.linux-x86_64-3.5
running build_ext
building '_sentencepiece' extension
creating build/temp.linux-x86_64-3.5
Traceback (most recent call last):
  File "setup.py", line 47, in <module>
    test_suite = 'sentencepiece_test.suite')
  File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.5/distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/noe/env/nmt/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 75, in run
    _build_ext.run(self)
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 338, in run
    self.build_extensions()
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 447, in build_extensions
    self._build_extensions_serial()
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 472, in _build_extensions_serial
    self.build_extension(ext)
  File "/home/noe/env/nmt/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 532, in build_extension
    depends=ext.depends)
  File "/usr/lib/python3.5/distutils/ccompiler.py", line 574, in compile
    self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
  File "/usr/lib/python3.5/distutils/unixccompiler.py", line 118, in _compile
    extra_postargs)
  File "/usr/lib/python3.5/distutils/ccompiler.py", line 909, in spawn
    spawn(cmd, dry_run=self.dry_run)
  File "/usr/lib/python3.5/distutils/spawn.py", line 36, in spawn
    _spawn_posix(cmd, search_path, dry_run=dry_run)
  File "/usr/lib/python3.5/distutils/spawn.py", line 89, in _spawn_posix
    log.info(' '.join(cmd))
TypeError: sequence item 22: expected str instance, bytes found

Reverting back to commit 6b4daf1 everything compiles file again.

num_threads for spm_train

Even though the default num_threads is 16, the load on the CPU is mainly concentrated on one:

Is this just due to the type of load?

How to build with -std=c++11

I neet to build sentencepiece with -std=c++11 and I set environment variables shown below.

export CC=/usr/local/gcc-5.4.0/bin/gcc
export CXX=/usr/local/gcc-5.4.0/bin/g++
export CXXFLAGS="$CXXFLAGS -std=c++11"

I can see from the output that the compiler changed to above gcc-5.4.0 ( my default is gcc-4.8.5 )

checking if /usr/local/gcc-5.4.0/bin/gcc supports -fno-rtti -fno-exceptions... no
checking for /usr/local/gcc-5.4.0/bin/gcc option to produce PIC... -fPIC -DPIC
checking if /usr/local/gcc-5.4.0/bin/gcc PIC flag -fPIC -DPIC works... yes
checking if /usr/local/gcc-5.4.0/bin/gcc static flag -static works... no

but it does not build with -std=c++11

I even added manually -std=c++11 to CXXFLAGS in the configure file and also Makefile, but it still does not build with -std=c++11

Windows Support

What's sentencepiece's story for windows support? Are there any future plans to support Windows?

Thank you.

Guidance on multilingual text data

Thanks for the awesome tool. I am working on multilingual text data, specifically the mix of Chinese and English so english words would be used in between Chinese characters without any space delimiter like 有hockey羽毛球欖球籃球足球. I don't have a lot of data like this. So I was wondering, will the tool work on such data if I fit in both chinese and english only text as input? If not, any insights on how this can be handled? Thanks a lot for your help in advance

[Q] Would you elaborate the differences of 4 corpus size parameters?

I have 50M sentences corpus to train. I'd like to know the difference of following parameters.
I set 50M, 10M, 5M, 50M respectively (x5 of default) and got crashed like issue#4 --
CHECK(!pieces.empty()) failed on serialize.
The vocab size I set was 32768.

   --input_sentence_size (maximum size of sentences the trainer loads)  type: int32  default: 10000000
   --mining_sentence_size (maximum size of sentences to make seed sentence piece)  type: int32  default: 2000000
   --seed_sentencepiece_size (the size of seed sentencepieces)  type: int32  default: 1000000
   --training_sentence_size (maximum size of sentences to train sentence pieces)  type: int32  default: 10000000

Python wrapper from PyPi is incompatible with Python 3

Recently, a fix for the compatibility with Python 3 of the python wrapper was committed. However, the version of the python wrapper uploaded to PyPi is from 2017-08-28 and does not contain the aforementioned recent fixes. Could you upload the newest version? (from the PiPy page I guess that @taku910 is the owner of the package)

how to use sentecepiece tensorflow Op

I see you have added Tensorflow Op in your recent master.
Does this mean we can use SentencePiece Op in tensorflow graph?

I wish to build libtensorflow_inference.so file with SentencePieceOp integrated in, so that I can have
a single Tensorflow Graph to make inferences.

Can you explain how we can utilize your recent master merged tensorflow op in this way?

Thanks in advance,

Problems installing Python bindings on Mac

Hello! I'm having some trouble installing the Python bindings on Mac OS, and just thought I'd mention it here in case anyone had similar trouble. This is within an Anaconda environment, Python 3.

Move into directory and install:

$ cd /path/to/sentencepiece
$ ./autogen.sh
$ ./configure
$ make
$ make install

Success. Try to install Python bindings:

$ python setup.py build
Package sentencepiece was not found in the pkg-config search path.
Perhaps you should add the directory containing `sentencepiece.pc'
to the PKG_CONFIG_PATH environment variable
No package 'sentencepiece' found
Failed to find sentencepiece pkgconfig

Fix the path.

$ export PKG_CONFIG_PATH=`pwd`/..

And try again:

$ python setup.py build
running build
running build_py
running build_ext
building '_sentencepiece' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.macosx-10.7-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/Users/neubig/usr/include
In file included from sentencepiece_wrap.cxx:3124:
/Users/neubig/usr/include/sentencepiece_processor.h:141:8: error: no template named 'unique_ptr' in namespace 'std'
  std::unique_ptr<Rep> rep_;
  ~~~~~^
/Users/neubig/usr/include/sentencepiece_processor.h:181:8: error: no template named 'unique_ptr' in namespace 'std'
  std::unique_ptr<std::string> rep_;
  ~~~~~^    
...

The strange thing is that when I run the command on its own by copy-pasting things seem to work.

$ gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.macosx-10.7-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/Users/neubig/usr/include

I'm not sure what would lead to the difference, but I'm stuck here...

[ feature request]

As of now, it is possible to specify user defined symbol to by pass some sequences.
It would be great if we could pass a "pattern" for these symbol especially when we have plenty of placeholders.
for instance pattern could be '(((*)))' and the * would be any kind of string.
cheers.

Problem about mixed-language word

In Chinese, there are some words composed by an english letter and a chinese character. They should be recogized as one word during the tokenizing process. However, all the mixed-language words are splited when I use sentence piece to tokenize them.

Is this a special feature in this tools that recognize the language before tokenizing? Can I disable this feature?

protobuf issues when installing

When I run ./configure during installation, I get the following error:

./configure: line 17069: syntax error near unexpected token `PROTOBUF,'
./configure: line 17069: `PKG_CHECK_MODULES(PROTOBUF, protoc >= 2.4.0)'

If I comment out the offending line configure passes, but I receive the following error during make:

Undefined symbols for architecture x86_64:
  "google::protobuf::Message::Utf8DebugString() const", referenced from:
      std::__1::__function::__func<main::$_2, std::__1::allocator<main::$_2>, void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in spm_encode_main.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: *** [spm_encode] Error 1
make[2]: *** [all] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

I am running macOS 10.13.1 and libprotoc 3.5.0

Decoding from ids with UNK tokens

The behaviour of the following function calls is strange:

>>> p.DecodeIds(p.EncodeAsIds('test ^^'))
'test  \u2047 '

If '^^' is unknown, why is it decoded as \u2047 and not unk token ?

It's properly encoded into ids:

>>> p.EncodeAsIds('test ^^') 
[2528, 6, 0]

Ability to avoid rare segmentation causing UNKs

When training a joint SPM model on two or more languages, is there a way to alleviate the problem of segmenting a token in language1 into subunits seen in language2 causing UNKs during test-time?

In subword-nmt, there's a vocabulary threshold for this that allows further segmentation of tokens until the subunits are seen with at least that threshold times in the relevant language.

Understanding BOS/EOS symbols

Are these control characters meant to be semantically meaningful for further dl tasks down the line, or for internal use by sentencepiece?

After

sp.SetEncodeExtraOptions('eos')

I can encode strings and the </s> token is appended automatically. However sp does not recognize </s> as a symbol if it's seen in the text:

sp.EncodeAsPieces('foo\nbar</s>')
=> ['▁f', 'oo', '\n', 'b', 'ar', '<', '/', 's', '>']

Guidance on how to implement subword sampling at train time

I guess I should be re-sampling tokenizations on the train data with SP before each epoch, but it would be nice to see a canonical implementation of this in $FRAMEWORK.

Does not recognize \n

Hi, I've found that feeding some multiline text into sentencepiece results in token for the newline character. How can I get SP to recognize that '\n' is a valid character?

Version of protoc required and alternative to libprotobuf9v5 for Red Hat

Hi,

I am trying to install sentencepiece following the instructions given in README, on a Red Hat distribution, and the make command gives me an error indicating that my protoc version is not adapted for sentencepiece_model.pb.h. May I ask which version was used to generate the file?

Also, it seems that no package libprotobuf9v5/libprotobuf-c++ is available for RHEL (I found for Ubuntu and Debian). Would there be an alternative to it? Thanks a lot.

Here is the detailed result:

$ make
make all-recursive
make[1]: Entering directory `/data/sentencepiece'
Making all in src
make[2]: Entering directory `/data/sentencepiece/src'
make all-am
make[3]: Entering directory `/data/sentencepiece/src'
g++ -DHAVE_CONFIG_H -I. -I. -std=c++11 -Wall -O3 -MT builder.o -MD -MP -MF .deps/builder.Tpo -c -o builder.o builder.cc
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
#error This file was generated by a newer version of protoc which is
^
sentencepiece_model.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
#error incompatible with your Protocol Buffer headers. Please update
^
sentencepiece_model.pb.h:14:2: error: #error your headers.
#error your headers.
^
sentencepiece_model.pb.h:22:35: fatal error: google/protobuf/arena.h: No such file or directory
#include <google/protobuf/arena.h>
^
compilation terminated.
make[3]: *** [builder.o] Error 1
make[3]: Leaving directory `/data/sentencepiece/src'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/data/sentencepiece/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/data/sentencepiece'

Then I downloaded the latest release of Protocol Buffer (v3.2.0) in C++, from https://github.com/google/protobuf/releases and compiled it.

When I try the 'make' command again, I got another error saying that this version is too new :

$ make
make all-recursive
make[1]: Entering directory `/data/sentencepiece'
Making all in src
make[2]: Entering directory `/data/sentencepiece/src'
make all-am
make[3]: Entering directory `/data/sentencepiece/src'
g++ -DHAVE_CONFIG_H -I. -I.. -std=c++11 -Wall -O3 -MT builder.o -MD -MP -MF .deps/builder.Tpo -c -o builder.o builder.cc
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
#error This file was generated by an older version of protoc which is
^
sentencepiece_model.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
#error incompatible with your Protocol Buffer headers. Please
^
sentencepiece_model.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
#error regenerate this file with a newer version of protoc.
^
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::TrainerSpec::model_prefix() const’:
sentencepiece_model.pb.h:938:95: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string*) const’
return model_prefix_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:938:95: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::NormalizerSpec::name() const’:
sentencepiece_model.pb.h:1477:87: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string) const’
return name_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:1477:87: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::NormalizerSpec::precompiled_charsmap() const’:
sentencepiece_model.pb.h:1531:103: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string) const’
return precompiled_charsmap_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:1531:103: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::ModelProto_SentencePiece::piece() const’:
sentencepiece_model.pb.h:1664:88: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string) const’
return piece_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:1664:88: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return *ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
make[3]: *** [builder.o] Error 1
make[3]: Leaving directory `/data/sentencepiece/src'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/data/sentencepiece/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/data/sentencepiece'
make: *** [all] Error 2

Thanks for your help.

Force some tokens to retain original form

Is the [title] possible? I mean, provide a list of tokens that will not be touched (i.e. will not be segmented) under the unigram setting?

Manually modifying SentencePiece model?

Hello,

I'd like to try to take in an existing sentencepiece model, modify some of its probabilities manually, then use it to segment text. What would be the easiest way to do this? The ".vocab" file is easy to modify, but the ".model" file is binary format, and it wasn't clear what to do here.

Frequency weighted training sentences

I don't know if this is already implemented or if there's a workaround for this. Sometimes, due to the large amount of data, what we have are training sentences that are already uniquely condensed (and we just keep their frequency counts somewhere). Of course, this condensed training set does not necessarily reflect the original statistics of the occurrences of each word and may not reflect the optimal choices for the tokens.

Through experiments, I create two separate segmentation models using the (1) 10M most frequent sentences (unique), and (2) 10M random sentences from the entire set (duplicates are spread in the chronological order that they were generated). Case (2) is of course favored by the empirical evidence suggested by the developers. However, for our task I found that case (1) model is performing better. So the experiment was inconclusive in deciding on which seeds set to choose next time.

I also found that the maximum number of training set is 100M and I wanted to take advantage of this limit since we have a huge amount of data. Of course I could simply generate random 100M sentences from all the training data, but I found it quite empirical. Especially since we're working on CJK languages.

So I guess in a nutshell:
(1) Is there a way to incorporate frequency counts of sentences?
(2) What is a statistically reliable way to condense a huge amount of data (let's say the original size is 300% of the maximum allowed), but still reflect the (approximate) statistics of the words?

Python doesn't yet support `hard_vocab_limit`

First, thanks for developing and releasing this package. It's really excellent and I recommend it to everyone :)

Just a very small request: the Python bindings from pip don't support the new hard_vocab_limit functionality, so when you have time I'd appreciate if you could make a new release to PyPI to that effect.

make install passes, but am unable to run spm

I followed all the installation steps without a hitch.

make passes, as does make check and sudo make install.

However, when I try to run spm_train, spm_decode, or spm_encode, I get the following error message:

spm_train: error while loading shared libraries: libsentencepiece.so.0: cannot open shared object file: No such file or directory

I am running Ubuntu 16.04.2.

Split longer word, rather than word by word

Firstly, greatly appreciate your library, it's very useful and easy to to use. But when using i've a trouble,.In Vietnamese vocabulary, a meaning word sometimes includes more than 1 word. For example, sentence "I live in Ha Noi", i want "Ha Noi" will stand together after being split. Is there any way or any parameter to handle this case ? Best wishes !

Spaces in user_defined_symbols?

I would like to define "foo bar" as user defined symbol. I see that the model can have white spaces in its tokens, is there a way to easily add this to user defined symbols?

The full alphabet is not preserved and the transformation is not reversible.

Hi,
Thank you for publishing such a great research!

After studying your docs we assumed that the default settings for training will retain the full alphabet so that the transformation is reversible. Based on that assumption we build our model for polish NLP contest and we are now trying not to get disqualified :(, (I hope that the competition jury will reason with us.)

We tracked the issue down to the default setting of character_coverage which is set to 0.9995.
I understand that normally this is recommended setting as the model may work better like that. But it would be good to add a mention of that setting in the docs so to help anyone that needs to have the transformation reversible.

I can submit PR to emphasize the fact that the default character_coverage is not 1.0 but I'm not sure how this parameter works.

For anyone that comes across this issue. Just train your model with ---character_coverage 1.0.
In our case, the default coverage of 0.9995 removed as many as 4 characters from our alphabet of size 92. The characters were indeed least frequently used so I guess that the coverage means how much of the text is going to be restored after reversing the transformation but that just guessing.

Anyway, this is a super small issue, and the algorithm is very useful. Once again thank you for making the library public.

Regards,
Piotr

google / sentencepiece Goto Github PK

sentencepiece's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs