GithubHelp home page GithubHelp logo

danieldk / sentencepiece Goto Github PK

View Code? Open in Web Editor NEW
19.0 2.0 6.0 293 KB

Rust binding for the sentencepiece library

License: Other

Rust 82.07% C++ 10.40% C 3.77% Shell 2.05% PowerShell 1.70%
sentencepiece rust

sentencepiece's Introduction

sentencepiece

This Rust crate is a binding for the sentencepiece unsupervised text tokenizer. The crate documentation is available online.

libsentencepiece dependency

This crate depends on the sentencepiece C++ library. By default, this dependency is treated as follows:

  • If sentencepiece could be found with pkg-config, the crate will link against the library found through pkg-config. Warning: dynamic linking only works correctly with sentencepiece 0.1.95 or later, due to a bug in earlier versions.
  • Otherwise, the crate's build script will do a static build of the sentencepiece library. This requires that cmake is available.

If you wish to override this behavior, the sentencepiece-sys crate offers two features:

  • system: always attempt to link to the sentencepiece library found with pkg-config.
  • static: always do a static build of the sentencepiece library and link against that.

sentencepiece's People

Contributors

danieldk avatar dependabot-preview[bot] avatar framp avatar systemcluster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sentencepiece's Issues

Loading a SentencePiece protobuf

Hello,

I'd like to load a sentence piece model file, and I thought about using your library for this purpose. I'd like not to actually have to install the C++ sentence piece library, and for now only load the model file into a Trie structure.

I see you used create a protobuf and compiled it into a Rust package. I tried using this compile protobuf withotu success so far:

use crate::sentencepiece::SentencePieceText;
use protobuf::parse_from_bytes;

let _contents = include_bytes!("path/to/toy.model");
let _ = parse_from_bytes::<SentencePieceText>(_contents).unwrap();

Could it be that I am trying to map the model to the wrong protobuf?
Thank you!

Support for subword regularization (SampleEncode)

I would find it helpful to be able to use subword regularization with this library. In practice, that would mean wrapping SampleEncodeAsSerializedProto with a function very similar to encode, but accepting two additional arguments.

Panic on encountering string with null byte

Rust strings are allowed to contain the null character (apparently 0 is a valid codepoint in UTF-8). Sentencepiece itself can handle this; for example in the python API you can call model.encode("\0") just fine. And the C++ API uses absl::string_view, which according to its documentation allows interior null characters.

However this rust wrapper casts rust str to CString, and then calls unwrap which panics:

let c_sentence = CString::new(sentence).unwrap();

I'm trying to use this library in a project involving large amounts of raw text, and part of why I find sentencepiece attractive is that it automatically takes care of and normalizes a large number of unicode artifacts. Would it be possible to adjust this wrapper to transparently pass strings to sentencepiece, even if they contain a null character?

Off by 1 error?

I'm comparing the output of this library with the output of python and I see a difference by 1 in the token ids
I can decode the python tokens in rust when I subtract 1 from the token ids:

image

from transformers import AutoTokenizer, NllbTokenizerFast, AutoModelForSeq2SeqLM


tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
article = "English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language, you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization."
inputs = tokenizer(article, return_tensors="pt")
print(inputs)
fn main() {
    use sentencepiece::SentencePieceProcessor;

    let spp_model_path = "path_to/sentencepiece.bpe.model"; #from the same hf repo as the model above: 
    let spp = SentencePieceProcessor::open(spp_model_path).unwrap();
    let article = "English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language, you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization.";
    let pieces = spp
        .encode(article)
        .unwrap()
        .into_iter()
        .map(|p| p.id)
        .collect::<Vec<_>>();
    println!("{:?}", &pieces);
    let result = spp.decode_piece_ids(&pieces);
    println!("{:?}", result);
    let pieces_rust = vec![
        30310, 103, 253989, 179, 248119, 68423, 248062, 253989, 248160, 247, 2635, 387, 348,
        179662, 65444, 5056, 9088, 201, 3291, 28063, 248074, 716, 22755, 201, 10410, 8161, 1481,
        1258, 248115, 248071, 6398, 201, 3291, 28063, 5056, 8, 30157, 65444, 248078, 1258, 12515,
        10410, 8161, 348, 113, 29132, 7553, 248282, 44777, 107, 348, 248058, 253989, 84411, 248119,
        7496, 253989, 22659, 50548, 37491, 451, 348, 1775, 429, 2500, 107, 21533, 117079, 248074,
    ];
    let pieces_python = vec![
        256047, 30311, 104, 253990, 256047, 248059, 253990, 2481, 61, 248, 2636, 388, 349, 179663,
        65445, 5057, 9089, 202, 3292, 28064, 248075, 717, 22756, 202, 10411, 8162, 1482, 1259,
        248116, 248072, 6399, 202, 3292, 28064, 5057, 9, 30158, 65445, 248079, 1259, 12516, 10411,
        8162, 349, 114, 29133, 7554, 248283, 44778, 108, 349, 248059, 253990, 84412, 248120, 7497,
        253990, 22660, 50549, 37492, 452, 349, 1776, 430, 2501, 108, 21534, 117080, 248075, 2,
    ]
    .iter()
    .filter(|p| p.to_owned().to_owned() < 256040)
    .map(|p| p.to_owned() - 1)
    .collect::<Vec<u32>>();
    let result_rust = spp.decode_piece_ids(&pieces_rust);
    println!("{:?}", result_rust);
    let result_py = spp.decode_piece_ids(&pieces_python);
    println!("{:?}", result_py);
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.