Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is fast

For HF, we use <div class="snippet-clipboard-content notranslate position-relative

We actually dived a bit: Rayon parallelism is kinda broken</li

Why the tokenizer is slower than tiktoken? about tokenizers HOT 8 OPEN

BigBinnie commented on September 21, 2024 2

Why the tokenizer is slower than tiktoken?

from tokenizers.

Comments (8)

ArthurZucker commented on September 21, 2024

Hey, could you share a reproducer?
Some things are related to the fact that we keep track of the offset and a lot of information, which tiktoken does not.
But we could only do this when ask and improve speed potentially.

from tokenizers.

github-actions commented on September 21, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from tokenizers.

ArthurZucker commented on September 21, 2024

It's high in my priority to do benchmarks and improve our code if needed!

from tokenizers.

BigBinnie commented on September 21, 2024

For HF, we use

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "xxx"
start = time.time()
encoded_input = tokenizer.encode(truncated_text)
end = time.time()

For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same

tokenizer = tiktoken.encoding_for_model("gpt-2")

please let me know if you need any other information

from tokenizers.

ArthurZucker commented on September 21, 2024

You are using GPT2Tokenizer which is the slow one. Use GPT2TokenizerFast 😅

from tokenizers.

github-actions commented on September 21, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from tokenizers.

ArthurZucker commented on September 21, 2024

We actually dived a bit:

Rayon parallelism is kinda broken
we have concurency on the cache for GPT2
We have memory allocation that are also slowing down
With #1560, was able to get similar performances as tiktoken, keep posted 😉

from tokenizers.

ArthurZucker commented on September 21, 2024

One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs

from tokenizers.

Recommend Projects

Why the tokenizer is slower than tiktoken? about tokenizers HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs