It would be nice if the vocab files be automatically downloaded if they don't already

Automatically loading vocab files about tokenizers HOT 7 CLOSED

phosseini commented on August 26, 2024 14

Automatically loading vocab files

from tokenizers.

Comments (7)

julien-c commented on August 26, 2024 14

Yes I agree that this should at least be clearer in the README as other people have reported this

from tokenizers.

aditya140 commented on August 26, 2024 12

Meanwhile these links can be used to download the vocab files for Bert models

from tokenizers.

mar-muel commented on August 26, 2024 9

Not sure if this is the best way, but as a workaround you can load the tokenizer from the transformer library and access the pretrained_vocab_files_map property which contains all download links (those should always be up to date).

Posting my method here, in case it's useful to anyone:

from tokenizers import BertWordPieceTokenizer
import urllib
from transformers import AutoTokenizer

def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path):
    vocab_files_map = tokenizer.pretrained_vocab_files_map
    vocab_files = {}
    for resource in vocab_files_map.keys():
        download_location = vocab_files_map[resource][model_type]
        f_path = os.path.join(output_path, os.path.basename(download_location))
        urllib.request.urlretrieve(download_location, f_path)
        vocab_files[resource] = f_path
    return vocab_files

model_type = 'bert-base-uncased'
output_path = './my_local_vocab_files/'
tokenizer = AutoTokenizer.from_pretrained(model_type)
vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path)
fast_tokenizer = BertWordPieceTokenizer(vocab_files.get('vocab_file'), vocab_files.get('merges_file'))

from tokenizers.

loopdigga96 commented on August 26, 2024 1

Thanks! I think it's a good idea to put these links somewhere in tutorial.

from tokenizers.

loopdigga96 commented on August 26, 2024

Any updates about it?

from tokenizers.

mrdvince commented on August 26, 2024

The links to the vocab files should be in the readme, took me a while to figure it out, mar-muel function works great.

from tokenizers.

github-actions commented on August 26, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from tokenizers.

Automatically loading vocab files about tokenizers HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs