HI! Currently <a href="https://docs.rs/tokenizers/0.6.0/tokenizers/tokenizer/trait

Export Trainer::train API into Python bindings about tokenizers HOT 3 CLOSED

vilunov commented on July 23, 2024

Export Trainer::train API into Python bindings

from tokenizers.

Comments (3)

tRosenflanz commented on July 23, 2024

Any news on this? Training from token counts would be very nice.
For instance, in my use case original data is processed in Spark so obtaining word counts is easy while dumping all of the strings into files would be very slow

from tokenizers.

n1t0 commented on July 23, 2024

I'm really not sure giving the ability to feed word counts is a good idea because the PreTokenizer is in charge of doing the pre-tokenization and it would introduce potential discrepancies between training and tokenization. Also, this trainer API is subject to change in the future, and may not be compatible with word counts anymore.

What about being able to stream raw strings instead?

from tokenizers.

tRosenflanz commented on July 23, 2024

That would be useful too albeit more cumbersome.

Would you mind expanding on the discrepancies you foresee? Currently, I use a custom loop using tfds subwordtextencoder methods that takes in word counts -> iterates over words, tokenizes them -> for each token increments token counter by the source word count -> builds encoder vocab using token counter.

    code_count = events.groupby("event_code").count().collect()
    print("Starting indexing")
    tokenizer = text_encoder.Tokenizer(
        alphanum_only=False, reserved_tokens=[_UNDERSCORE_REPLACEMENT]
    )
    token_counts= defaultdict(int)
    for row in code_count:
        tokens = tokenizer.tokenize(row["event_code"])
        tokens = _prepare_tokens_for_encode(tokens)
        for token in tokens:
            token_counts[token] += row["count"]
    encoder = tfds.features.text.SubwordTextEncoder._build_from_token_counts(
        token_counts, min_count,[],4, 40
    )
    encoder.save_to_file(os.path.join(output_path, "tensorflow_vocabulary_file"))

But this is a) really slow when encoding b) ugly and hard to customize since it uses hidden method

from tokenizers.

Recommend Projects

Export Trainer::train API into Python bindings about tokenizers HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs