danieldk / sentencepiece Goto Github PK

Rust binding for the sentencepiece library

License: Other

Rust 82.07% C++ 10.40% C 3.77% Shell 2.05% PowerShell 1.70%

sentencepiece's Introduction

sentencepiece

This Rust crate is a binding for the sentencepiece unsupervised text tokenizer. The crate documentation is available online.

`libsentencepiece` dependency

This crate depends on the sentencepiece C++ library. By default, this dependency is treated as follows:

If sentencepiece could be found with pkg-config, the crate will link against the library found through pkg-config. Warning: dynamic linking only works correctly with sentencepiece 0.1.95 or later, due to a bug in earlier versions.
Otherwise, the crate's build script will do a static build of the sentencepiece library. This requires that cmake is available.

If you wish to override this behavior, the sentencepiece-sys crate offers two features:

system: always attempt to link to the sentencepiece library found with pkg-config.
static: always do a static build of the sentencepiece library and link against that.

sentencepiece's People

Contributors

Stargazers

Watchers

Forkers

ssubu nikita-skobov moomoofarm1 framp acewin peterhj

sentencepiece's Issues

Loading a SentencePiece protobuf

Hello,

I'd like to load a sentence piece model file, and I thought about using your library for this purpose. I'd like not to actually have to install the C++ sentence piece library, and for now only load the model file into a Trie structure.

I see you used create a protobuf and compiled it into a Rust package. I tried using this compile protobuf withotu success so far:

use crate::sentencepiece::SentencePieceText;
use protobuf::parse_from_bytes;

let _contents = include_bytes!("path/to/toy.model");
let _ = parse_from_bytes::<SentencePieceText>(_contents).unwrap();

Could it be that I am trying to map the model to the wrong protobuf?
Thank you!

Document is not reachable

https://rustdoc.danieldk.eu/sentencepiece is not reachable from my environment

Support for subword regularization (SampleEncode)

I would find it helpful to be able to use subword regularization with this library. In practice, that would mean wrapping SampleEncodeAsSerializedProto with a function very similar to encode, but accepting two additional arguments.

Panic on encountering string with null byte

Rust strings are allowed to contain the null character (apparently 0 is a valid codepoint in UTF-8). Sentencepiece itself can handle this; for example in the python API you can call model.encode("\0") just fine. And the C++ API uses absl::string_view, which according to its documentation allows interior null characters.

However this rust wrapper casts rust str to CString, and then calls unwrap which panics:

sentencepiece/src/lib.rs

Line 192 in abdfb4c

let c_sentence = CString::new(sentence).unwrap();

I'm trying to use this library in a project involving large amounts of raw text, and part of why I find sentencepiece attractive is that it automatically takes care of and normalizes a large number of unicode artifacts. Would it be possible to adjust this wrapper to transparently pass strings to sentencepiece, even if they contain a null character?

Off by 1 error?

I'm comparing the output of this library with the output of python and I see a difference by 1 in the token ids
I can decode the python tokens in rust when I subtract 1 from the token ids:

from transformers import AutoTokenizer, NllbTokenizerFast, AutoModelForSeq2SeqLM


tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
article = "English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language, you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization."
inputs = tokenizer(article, return_tensors="pt")
print(inputs)

fn main() {
    use sentencepiece::SentencePieceProcessor;

    let spp_model_path = "path_to/sentencepiece.bpe.model"; #from the same hf repo as the model above: 
    let spp = SentencePieceProcessor::open(spp_model_path).unwrap();
    let article = "English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language, you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization.";
    let pieces = spp
        .encode(article)
        .unwrap()
        .into_iter()
        .map(|p| p.id)
        .collect::<Vec<_>>();
    println!("{:?}", &pieces);
    let result = spp.decode_piece_ids(&pieces);
    println!("{:?}", result);
    let pieces_rust = vec![
        30310, 103, 253989, 179, 248119, 68423, 248062, 253989, 248160, 247, 2635, 387, 348,
        179662, 65444, 5056, 9088, 201, 3291, 28063, 248074, 716, 22755, 201, 10410, 8161, 1481,
        1258, 248115, 248071, 6398, 201, 3291, 28063, 5056, 8, 30157, 65444, 248078, 1258, 12515,
        10410, 8161, 348, 113, 29132, 7553, 248282, 44777, 107, 348, 248058, 253989, 84411, 248119,
        7496, 253989, 22659, 50548, 37491, 451, 348, 1775, 429, 2500, 107, 21533, 117079, 248074,
    ];
    let pieces_python = vec![
        256047, 30311, 104, 253990, 256047, 248059, 253990, 2481, 61, 248, 2636, 388, 349, 179663,
        65445, 5057, 9089, 202, 3292, 28064, 248075, 717, 22756, 202, 10411, 8162, 1482, 1259,
        248116, 248072, 6399, 202, 3292, 28064, 5057, 9, 30158, 65445, 248079, 1259, 12516, 10411,
        8162, 349, 114, 29133, 7554, 248283, 44778, 108, 349, 248059, 253990, 84412, 248120, 7497,
        253990, 22660, 50549, 37492, 452, 349, 1776, 430, 2501, 108, 21534, 117080, 248075, 2,
    ]
    .iter()
    .filter(|p| p.to_owned().to_owned() < 256040)
    .map(|p| p.to_owned() - 1)
    .collect::<Vec<u32>>();
    let result_rust = spp.decode_piece_ids(&pieces_rust);
    println!("{:?}", result_rust);
    let result_py = spp.decode_piece_ids(&pieces_python);
    println!("{:?}", result_py);
}

Any plan to support detokenize

Hi, wondering is there a plan to support decode api
https://github.com/google/sentencepiece/blob/master/doc/api.md#detokenize-text-postprocessing

danieldk / sentencepiece Goto Github PK

sentencepiece's Introduction

sentencepiece

`libsentencepiece` dependency

sentencepiece's People

Contributors

Stargazers

Watchers

Forkers

sentencepiece's Issues

Loading a SentencePiece protobuf

Document is not reachable

Support for subword regularization (SampleEncode)

Panic on encountering string with null byte

Off by 1 error?

Any plan to support detokenize

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

danieldk / sentencepiece Goto Github PK

sentencepiece's Introduction

sentencepiece

libsentencepiece dependency

sentencepiece's People

Contributors

Stargazers

Watchers

Forkers

sentencepiece's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`libsentencepiece` dependency