Comments (6)
Yes I agree that this should at least be clearer in the README as other people have reported this
from tokenizers.
Meanwhile these links can be used to download the vocab files for Bert models
'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt",
'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt",
'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt",
'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt"
'''
from tokenizers.
Not sure if this is the best way, but as a workaround you can load the tokenizer from the transformer library and access the pretrained_vocab_files_map
property which contains all download links (those should always be up to date).
Posting my method here, in case it's useful to anyone:
from tokenizers import BertWordPieceTokenizer
import urllib
from transformers import AutoTokenizer
def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path):
vocab_files_map = tokenizer.pretrained_vocab_files_map
vocab_files = {}
for resource in vocab_files_map.keys():
download_location = vocab_files_map[resource][model_type]
f_path = os.path.join(output_path, os.path.basename(download_location))
urllib.request.urlretrieve(download_location, f_path)
vocab_files[resource] = f_path
return vocab_files
model_type = 'bert-base-uncased'
output_path = './my_local_vocab_files/'
tokenizer = AutoTokenizer.from_pretrained(model_type)
vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path)
fast_tokenizer = BertWordPieceTokenizer(vocab_files.get('vocab_file'), vocab_files.get('merges_file'))
from tokenizers.
Thanks! I think it's a good idea to put these links somewhere in tutorial.
from tokenizers.
Any updates about it?
from tokenizers.
The links to the vocab files should be in the readme, took me a while to figure it out, mar-muel function works great.
from tokenizers.
Related Issues (20)
- Tokens display issues HOT 1
- Train tokenizer on integer lists, not strings HOT 3
- Assign `<unusedXX>` tokens with `special_tokens` without growing vocab size HOT 3
- BPE Decoder cleanup option HOT 2
- Issue merging across whitespaces HOT 1
- `cargo build` fails for python bindings when `--locked` is passed for `v0.15.1` and `v0.15.2` HOT 2
- tokenizers-linux-x64-musl is not found when running inside node apline docker HOT 1
- `BertWordPieceTokenizer` not saving with `sep_token` marked HOT 2
- error: casting `&T` to `&mut T` is undefined behavior HOT 1
- Is it possible to pass a tokenizer from Python into Rust?
- Discrepancy Between GitHub Release and NPM Package Version & Missing Dependencies HOT 1
- Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens)
- Unsound use of unsafe in `src/utils/parallelism.rs`
- LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)` HOT 1
- StripAccents doesn't work
- Issue in installing rudalle on google colab, !pip install rudalle
- Extended vocab tokenizer merging text into a single string without spaces while decoding
- offline installation HOT 1
- Failing to build bindings with 0.19.1 HOT 1
- Python Binding: Tokenizer.from_file() cannot parse JSON file of tokens HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tokenizers.