I am trying to compute how efficient is the spm tokenizer I trained. To do that, I wou

Only Pretokenization about sentencepiece HOT 3 CLOSED

SeverinoDaDalt commented on June 14, 2024

Only Pretokenization

from sentencepiece.

Comments (3)

taku910 commented on June 14, 2024

What exactly do you mean by pre-tokenization here? Sentencepiece does not have a pre-tokenization step because it processes the raw text directly.

from sentencepiece.

SeverinoDaDalt commented on June 14, 2024

Sorry, I assumed that sentencepiece does some pre-processing on the sentences and that this was called pre-tokenization.

What I am asking about are what are the rules by which sentencepiece splits words during training (for example you cannot get a token which is you▁are or ▁other...)). I know that it does whitespace splitting and, by the a quick analysis of the resulting vocabulary, it separates sequences of punctuation symbols and sequences of alphanumeric symbols. Are there more of these regex?

from sentencepiece.

taku910 commented on June 14, 2024

Sentencepiece doesn't have the concept like "words". There are several languages that do not have whitespace between words. In these languages, it is not trivial to define and run the pre-tokenization step. The main motivation of sentence piece is to get rid of all these complicated pre-tokenization step to make the tokenization language independent. sentencepiece doesn't have any language dependent word tokenization rules/regexp/patterns etc.

"you are" is not split because the vocab doesn't contain the token "you▁are" --split_by_whitespace option of traininer allows to extract token you▁are.

string normalization e.g. NKFC is only the prerprocessing. Please see the following document to configure the normalization.
https://github.com/google/sentencepiece/blob/master/doc/normalization.md

from sentencepiece.

Recommend Projects

Only Pretokenization about sentencepiece HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs