GithubHelp home page GithubHelp logo

georg-jung / fastberttokenizer Goto Github PK

View Code? Open in Web Editor NEW
29.0 2.0 5.0 19.03 MB

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

Home Page: https://www.nuget.org/packages/FastBertTokenizer

License: MIT License

C# 91.66% Dockerfile 0.11% Python 0.78% PowerShell 0.94% Rust 6.50%
ai bert bert-embeddings llm machine-learning natural-language-processing nlp nlp-machine-learning tokenization tokens wordpiece wordpiece-tokenization

fastberttokenizer's Introduction

FastBertTokenizer

NuGet version (FastBertTokenizer) .NET Build codecov

A fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks. Native AOT compatible and support for netstandard2.0.

Goals

  • Enabling you to run your AI workloads on .NET in production.
  • Correctness - Results that are equivalent to HuggingFace Transformers' AutoTokenizer's in all practical cases.
  • Speed - Tokenization should be as fast as reasonably possible.
  • Ease of use - The API should be easy to understand and use.

Getting Started

dotnet new console
dotnet add package FastBertTokenizer
using FastBertTokenizer;

var tok = new BertTokenizer();
await tok.LoadFromHuggingFaceAsync("bert-base-uncased");
var (inputIds, attentionMask, tokenTypeIds) = tok.Encode("Lorem ipsum dolor sit amet.");
Console.WriteLine(string.Join(", ", inputIds.ToArray()));
var decoded = tok.Decode(inputIds.Span);
Console.WriteLine(decoded);

// Output:
// 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102
// [CLS] lorem ipsum dolor sit amet. [SEP]

example project

Comparison to BERTTokenizers

Note that while BERTTokenizers handles token type incorrectly, it does support input of two pieces of text that are tokenized with a separator in between. FastBertTokenizer currently does not support this.

Speed / Benchmarks

tl;dr: FastBertTokenizer can encode 1 GB of text in around 2 s on a typical notebook CPU from 2020.

All benchmarks were performed on a typical end user notebook, a ThinkPad T14s Gen 1:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3527/23H2/2023Update/SunValley3)
AMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.204

Similar results can also be observed using GitHub Actions. Note that using shared CI runners for benchmarking has drawbacks and can lead to varying results though.

on NET 6.0 vs. on NET 8.0

  • .NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2 vs .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  • Workload: Encode up to 512 tokens from each of 15,000 articles from simple english wikipedia.
  • Results: Total tokens produced: 3,657,145; on .NET 8: ~11m tokens/s single threaded, 73m tokens/s multi threaded.
Method Runtime Mean Error StdDev Ratio Gen0 Gen1 Gen2 Allocated Alloc Ratio
Singlethreaded .NET 6.0 450.39 ms 7.340 ms 6.866 ms 1.00 - - - 2 MB 1.00
MultithreadedMemReuseBatched .NET 6.0 72.46 ms 1.337 ms 1.251 ms 0.16 750.0000 250.0000 250.0000 12.75 MB 6.39
Singlethreaded .NET 8.0 332.51 ms 6.574 ms 7.826 ms 1.00 - - - 1.99 MB 1.00
MultithreadedMemReuseBatched .NET 8.0 50.83 ms 0.999 ms 1.995 ms 0.15 500.0000 - - 12.75 MB 6.40
  • SharpToken v2.0.2
  • .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  • Workload: Fully encode 15,000 articles from simple english wikipedia. Total tokens produced by FastBertTokenizer: 5,807,949 (~9.4m tokens/s single threaded).

This isn't an apples to apples comparison as BPE (what SharpToken does) and WordPiece encoding (what FastBertTokenizer does) are different tasks/algorithms. Both were applied to exactly the same texts/corpus though.

Method Mean Error StdDev Gen0 Gen1 Allocated
SharpTokenFullArticles 1,551.9 ms 25.82 ms 24.15 ms 5000.0000 2000.0000 32.56 MB
FastBertTokenizerFullArticles 620.3 ms 7.00 ms 6.21 ms - - 2.26 MB

vs. HuggingFace tokenizers (Rust)

tokenizers v0.19.1

I'm not really experienced in benchmarking rust code, but my attempts using criterion.rs (see src/HuggingfaceTokenizer/BenchRust) suggest that it takes tokenizers around

  • batched/multi threaded: ~2 s (~2.9m tokens/s)
  • single threaded: ~10 s (~0.6m tokens/s)

to produce 5,807,947 tokens from the same 15k simple english wikipedia articles. Contrary to what one might expect, this does mean that FastBertTokenizer, beeing a managed implementation, outperforms tokenizers. It should be noted though that tokenizers has a much more complete feature set while FastBertTokenizer is specifically optimized for WordPiece/Bert encoding.

The tokenizers repo states Takes less than 20 seconds to tokenize a GB of text on a server's CPU. As 26 MB of text take ~2s on my notebook CPU, 1 GB would take roughly 80 s. I think it makes sense that "a server's CPU" might be 4x as fast as my notebook's CPU and thus think my results seem plausible. It is however also possible that I unintentionally handicapped tokenizers somehow. Please let me know if you think so!

  • BERTTokenizers v1.2.0
  • .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  • Workload: Prefixes of the contents of 15k simple english wikipedia articles, preprocessed to make them encodable by BERTTokenizers.
Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
NMZivkovic_BertTokenizers 2,576.0 ms 15.49 ms 13.73 ms 968000.0000 40000.0000 1000.0000 3430.51 MB
FastBertTokenizer_SameDataAsBertTokenizers 229.8 ms 4.55 ms 6.23 ms - - - 1.03 MB

Logo

Created by combining https://icons.getbootstrap.com/icons/cursor-text/ in .NET brand color with https://icons.getbootstrap.com/icons/braces/.

fastberttokenizer's People

Contributors

dependabot[bot] avatar georg-jung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fastberttokenizer's Issues

Add Decode support for input_id sequences that don't start at a word prefix

Also, the Decode method currently assumes the first input_id you pass as its argument represents the beginning of a word. If the first passed id represents a word suffix - or, read: some arbitrary position in a list of input_ids, which doesn't happen to represent the start of a word -, it will probably throw a KeyNotFoundException. This probably isn't the best way how this method could work and I'm happy to fix that - if it helps you I could do it quite soon, please let me know.

#38 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.