GithubHelp home page GithubHelp logo

Comments (3)

polarathene avatar polarathene commented on July 2, 2024

In the meantime, I've implemented this alternative workaround (using buildstructor):

struct TokenizerX;
#[buildstructor::buildstructor]
impl TokenizerX {
    #[builder]
    fn try_new<'a>(
        with_model: ModelWrapper,
        with_decoder: Option<Decoder<'a>>,
        with_normalizer: Option<Normalizer<'a>>,
    ) -> Result<Tokenizer> {
        let mut tokenizer = Tokenizer::new(with_model);

        // Handle local enum to remote enum type:
        if let Some(decoder) = with_decoder {
            let d = DecoderWrapper::try_from(decoder)?;
            tokenizer.with_decoder(d);
        }
        if let Some(normalizer) = with_normalizer {
            let n = NormalizerWrapper::try_from(normalizer)?;
            tokenizer.with_normalizer(n);
        }

        Ok(tokenizer)
    }
}

Usage:

let mut tokenizer: Tokenizer = TokenizerX::try_builder()
    .with_model(model)
    .with_decoder(decoder)
    .with_normalizer(normalizer)
    .build()?;

The local to remote enum logic above is for the related DecoderWrapper + NormalizeWrapper enums which were also a bit noisy to use / grok, so I have a similar workaround for those:

let decoder = Decoder::Sequence(vec![
    Decoder::Replace("_", " "),
    Decoder::ByteFallback,
    Decoder::Fuse,
    Decoder::Strip(' ', 1, 0),
]);

let normalizer = Normalizer::Sequence(vec![
    Normalizer::Prepend("▁"),
    Normalizer::Replace(" ", "▁"),
]);

More details at mistral.rs.

from tokenizers.

ArthurZucker avatar ArthurZucker commented on July 2, 2024

The builder is I believe mostly used fro training

from tokenizers.

polarathene avatar polarathene commented on July 2, 2024

@ArthurZucker perhaps you could better document that? Because by naming convention and current docs comment it implies it is the builder pattern for the Tokenizer struct:

Builder for Tokenizer structs.

It provides an API that matches what you'd expect of a builder API, and it's build() method returns a type that is used to construct a Tokenizer struct (which also has a From impl for this type):

impl<M, N, PT, PP, D> From<TokenizerImpl<M, N, PT, PP, D>> for Tokenizer
where
M: Into<ModelWrapper>,
N: Into<NormalizerWrapper>,
PT: Into<PreTokenizerWrapper>,
PP: Into<PostProcessorWrapper>,
D: Into<DecoderWrapper>,
{
fn from(t: TokenizerImpl<M, N, PT, PP, D>) -> Self {
Self(TokenizerImpl {
model: t.model.into(),
normalizer: t.normalizer.map(Into::into),
pre_tokenizer: t.pre_tokenizer.map(Into::into),
post_processor: t.post_processor.map(Into::into),
decoder: t.decoder.map(Into::into),
added_vocabulary: t.added_vocabulary,
padding: t.padding,
truncation: t.truncation,
})
}
}

impl Tokenizer {
/// Construct a new Tokenizer based on the model.
pub fn new(model: impl Into<ModelWrapper>) -> Self {
Self(TokenizerImpl::new(model.into()))
}
/// Unwrap the TokenizerImpl.
pub fn into_inner(
self,
) -> TokenizerImpl<
ModelWrapper,
NormalizerWrapper,
PreTokenizerWrapper,
PostProcessorWrapper,
DecoderWrapper,
> {
self.0
}

#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct Tokenizer(
TokenizerImpl<
ModelWrapper,
NormalizerWrapper,
PreTokenizerWrapper,
PostProcessorWrapper,
DecoderWrapper,
>,
);


As the issue reports though, that doesn't seem to work very well, the builder API is awkward to use. You could probably adapt it to use buildstructor similar to how I have shown above with my TokenizerX workaround type (which also does a similar workaround for Decoder / Normalizer inputs to provide a better DX, but that is not required).

Presently, due to the reported issue here the builder offers little value vs creating the tokenizer without a fluent builder API.

from tokenizers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.