Comments (3)
What exactly do you mean by pre-tokenization here? Sentencepiece does not have a pre-tokenization step because it processes the raw text directly.
from sentencepiece.
Sorry, I assumed that sentencepiece does some pre-processing on the sentences and that this was called pre-tokenization.
What I am asking about are what are the rules by which sentencepiece splits words during training (for example you cannot get a token which is you▁are
or ▁other...)
). I know that it does whitespace splitting and, by the a quick analysis of the resulting vocabulary, it separates sequences of punctuation symbols and sequences of alphanumeric symbols. Are there more of these regex?
from sentencepiece.
Sentencepiece doesn't have the concept like "words". There are several languages that do not have whitespace between words. In these languages, it is not trivial to define and run the pre-tokenization step. The main motivation of sentence piece is to get rid of all these complicated pre-tokenization step to make the tokenization language independent. sentencepiece doesn't have any language dependent word tokenization rules/regexp/patterns etc.
"you are" is not split because the vocab doesn't contain the token "you▁are" --split_by_whitespace
option of traininer allows to extract token you▁are.
string normalization e.g. NKFC is only the prerprocessing. Please see the following document to configure the normalization.
https://github.com/google/sentencepiece/blob/master/doc/normalization.md
from sentencepiece.
Related Issues (20)
- pip subprocess to install build dependencies did not run successfully. │ exit code: 1 HOT 1
- Windows pip Dependancy Installation Error HOT 2
- Any api for setting user defined symbols? HOT 1
- Inconsistent result between py and cpp HOT 1
- Error when running this command: pip install 'transformers[tf-cpu]' on mac HOT 1
- Support for Windows Python 3.12.2
- Is GGUF supported? HOT 1
- Treat Hawaiian Glottal stop as consonant, not punctuation HOT 4
- No make file found while build and install the Python wrapper HOT 2
- Tokenize at the word level without spacers nor joiners HOT 2
- Build sentencepiece with mingw HOT 1
- Tokenization for phonetic languages HOT 3
- Runtime error on iOS HOT 6
- Convert SentencePiece .vocab format to OpenNMT-py .onmt_vocab format HOT 1
- I want to obtain a model file using my vocab! HOT 1
- resume/restart training of tokenizer HOT 3
- How long does it take to train 31.2GB text data? HOT 1
- How to deal with id HOT 2
- Wrong calculation of max_score in unigram_model.cc
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sentencepiece.