GithubHelp home page GithubHelp logo

Comments (3)

taku910 avatar taku910 commented on June 14, 2024

What exactly do you mean by pre-tokenization here? Sentencepiece does not have a pre-tokenization step because it processes the raw text directly.

from sentencepiece.

SeverinoDaDalt avatar SeverinoDaDalt commented on June 14, 2024

Sorry, I assumed that sentencepiece does some pre-processing on the sentences and that this was called pre-tokenization.

What I am asking about are what are the rules by which sentencepiece splits words during training (for example you cannot get a token which is you▁are or ▁other...)). I know that it does whitespace splitting and, by the a quick analysis of the resulting vocabulary, it separates sequences of punctuation symbols and sequences of alphanumeric symbols. Are there more of these regex?

from sentencepiece.

taku910 avatar taku910 commented on June 14, 2024

Sentencepiece doesn't have the concept like "words". There are several languages that do not have whitespace between words. In these languages, it is not trivial to define and run the pre-tokenization step. The main motivation of sentence piece is to get rid of all these complicated pre-tokenization step to make the tokenization language independent. sentencepiece doesn't have any language dependent word tokenization rules/regexp/patterns etc.

"you are" is not split because the vocab doesn't contain the token "you▁are" --split_by_whitespace option of traininer allows to extract token you▁are.

string normalization e.g. NKFC is only the prerprocessing. Please see the following document to configure the normalization.
https://github.com/google/sentencepiece/blob/master/doc/normalization.md

from sentencepiece.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.