huggingface / transformers Goto Github PK

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Home Page: https://huggingface.co/transformers

License: Apache License 2.0

Python 99.22% Dockerfile 0.06% Shell 0.05% Makefile 0.01% Jsonnet 0.01% C 0.01% C++ 0.05% Cuda 0.59% Cython 0.01%

nlp natural-language-processing pytorch language-model tensorflow bert language-models pytorch-transformers nlp-library transformer model-hub pretrained-models jax flax seq2seq speech-recognition hacktoberfest python machine-learning deep-learning

transformers's Introduction

Hugging Face Transformers Library

English | 简体中文 | 繁體中文 | 한국어 | Español | 日本語 | हिन्दी | Русский | Рortuguês | తెలుగు | Français | Deutsch | Tiếng Việt |

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow

🤗 Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

These models can be applied on:

📝 Text, for tasks like text classification, information extraction, question answering, summarization, translation, and text generation, in over 100 languages.
🖼️ Images, for tasks like image classification, object detection, and segmentation.
🗣️ Audio, for tasks like speech recognition and audio classification.

Transformer models can also perform tasks on several modalities combined, such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.

Online demos

You can test most of our models directly on their pages from the model hub. We also offer private model hosting, versioning, & an inference API for public and private models.

Here are a few examples:

In Natural Language Processing:

In Computer Vision:

In Audio:

In Multimodal tasks:

100 projects using Transformers

Transformers is more than a toolkit to use pretrained models: it's a community of projects built around it and the Hugging Face Hub. We want Transformers to enable developers, researchers, students, professors, engineers, and anyone else to build their dream projects.

In order to celebrate the 100,000 stars of transformers, we have decided to put the spotlight on the community, and we have created the awesome-transformers page which lists 100 incredible projects built in the vicinity of transformers.

If you own or use a project that you believe should be part of the list, please open a PR to add it!

If you are looking for custom support from the Hugging Face team

Quick tour

To immediately use a model on a given input (text, image, audio, ...), we provide the pipeline API. Pipelines group together a pretrained model with the preprocessing that was used during that model's training. Here is how to quickly use a pipeline to classify positive versus negative texts:

>>> from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to introduce pipeline to the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9996980428695679}]

The second line of code downloads and caches the pretrained model used by the pipeline, while the third evaluates it on the given text. Here, the answer is "positive" with a confidence of 99.97%.

Many tasks have a pre-trained pipeline ready to go, in NLP but also in computer vision and speech. For example, we can easily extract detected objects in an image:

>>> import requests
>>> from PIL import Image
>>> from transformers import pipeline

# Download an image with cute cats
>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png"
>>> image_data = requests.get(url, stream=True).raw
>>> image = Image.open(image_data)

# Allocate a pipeline for object detection
>>> object_detector = pipeline('object-detection')
>>> object_detector(image)
[{'score': 0.9982201457023621,
  'label': 'remote',
  'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
 {'score': 0.9960021376609802,
  'label': 'remote',
  'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},
 {'score': 0.9954745173454285,
  'label': 'couch',
  'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},
 {'score': 0.9988006353378296,
  'label': 'cat',
  'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},
 {'score': 0.9986783862113953,
  'label': 'cat',
  'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]

Here, we get a list of objects detected in the image, with a box surrounding the object and a confidence score. Here is the original image on the left, with the predictions displayed on the right:

You can learn more about the tasks supported by the pipeline API in this tutorial.

In addition to pipeline, to download and use any of the pretrained models on your given task, all it takes is three lines of code. Here is the PyTorch version:

>>> from transformers import AutoTokenizer, AutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased")

>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)

And here is the equivalent code for TensorFlow:

>>> from transformers import AutoTokenizer, TFAutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased")

>>> inputs = tokenizer("Hello world!", return_tensors="tf")
>>> outputs = model(**inputs)

The tokenizer is responsible for all the preprocessing the pretrained model expects and can be called directly on a single string (as in the above examples) or a list. It will output a dictionary that you can use in downstream code or simply directly pass to your model using the ** argument unpacking operator.

The model itself is a regular Pytorch nn.Module or a TensorFlow tf.keras.Model (depending on your backend) which you can use as usual. This tutorial explains how to integrate such a model into a classic PyTorch or TensorFlow training loop, or how to use our Trainer API to quickly fine-tune on a new dataset.

Why should I use transformers?

Easy-to-use state-of-the-art models:
- High performance on natural language understanding & generation, computer vision, and audio tasks.
- Low barrier to entry for educators and practitioners.
- Few user-facing abstractions with just three classes to learn.
- A unified API for using all our pretrained models.
Lower compute costs, smaller carbon footprint:
- Researchers can share trained models instead of always retraining.
- Practitioners can reduce compute time and production costs.
- Dozens of architectures with over 400,000 pretrained models across all modalities.
Choose the right framework for every part of a model's lifetime:
- Train state-of-the-art models in 3 lines of code.
- Move a single model between TF2.0/PyTorch/JAX frameworks at will.
- Seamlessly pick the right framework for training, evaluation, and production.
Easily customize a model or an example to your needs:
- We provide examples for each architecture to reproduce the results published by its original authors.
- Model internals are exposed as consistently as possible.
- Model files can be used independently of the library for quick experiments.

Why shouldn't I use transformers?

This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving into additional abstractions/files.
The training API is not intended to work on any model but is optimized to work with the models provided by the library. For generic machine learning loops, you should use another library (possibly, Accelerate).
While we strive to present as many use cases as possible, the scripts in our examples folder are just that: examples. It is expected that they won't work out-of-the-box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs.

Installation

With pip

This repository is tested on Python 3.8+, Flax 0.4.1+, PyTorch 1.11+, and TensorFlow 2.6+.

You should install 🤗 Transformers in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

First, create a virtual environment with the version of Python you're going to use and activate it.

Then, you will need to install at least one of Flax, PyTorch, or TensorFlow. Please refer to TensorFlow installation page, PyTorch installation page and/or Flax and Jax installation pages regarding the specific installation command for your platform.

When one of those backends has been installed, 🤗 Transformers can be installed using pip as follows:

pip install transformers

If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you must install the library from source.

With conda

🤗 Transformers can be installed using conda as follows:

conda install conda-forge::transformers

NOTE: Installing transformers from the huggingface channel is deprecated.

Follow the installation pages of Flax, PyTorch or TensorFlow to see how to install them with conda.

NOTE: On Windows, you may be prompted to activate Developer Mode in order to benefit from caching. If this is not an option for you, please let us know in this issue.

Model architectures

All the model checkpoints provided by 🤗 Transformers are seamlessly integrated from the huggingface.co model hub, where they are uploaded directly by users and organizations.

Current number of checkpoints:

🤗 Transformers currently provides the following architectures: see here for a high-level summary of each them.

To check if each model has an implementation in Flax, PyTorch or TensorFlow, or has an associated tokenizer backed by the 🤗 Tokenizers library, refer to this table.

These implementations have been tested on several datasets (see the example scripts) and should match the performance of the original implementations. You can find more details on performance in the Examples section of the documentation.

Learn more

Section	Description
Documentation	Full API documentation and tutorials
Task summary	Tasks supported by 🤗 Transformers
Preprocessing tutorial	Using the `Tokenizer` class to prepare data for the models
Training and fine-tuning	Using the models provided by 🤗 Transformers in a PyTorch/TensorFlow training loop and the `Trainer` API
Quick tour: Fine-tuning/usage scripts	Example scripts for fine-tuning models on a wide range of tasks
Model sharing and uploading	Upload and share your fine-tuned models with the community

Citation

We now have a paper you can cite for the 🤗 Transformers library:

@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}

transformers's People

Contributors

Stargazers

Watchers

Forkers

geraldinelemeur g-wang letsdodatascience tony32769 shubhampachori12110095 chsasank codeaudit elyase knutole panyang wanjinchang dantodor dongwandou daniel-ranti yunsh3432 friendshipity shiyangbinxixi rameezrehman83 lidhcs caiyunapp allensmile xkuang hujinlong1226 ruizewang yucoian ml-lab liu-nlper kingofoz muximuxi ruimao1988 swordsmanxyz benjaminwinter walden2013 shinichr albertwy nh007cs wolegechu devhttps lyan62 lingershaw time-to-learn-ml xiaofengzhou jiapei100 melody1235813 zgsxwsdxg sweetcard sc89703312 miracle-cl djlin0418 w6688j uptodiff mjc14 cooper111 cnglen 0101011 binwone artemisart alex-fabbri nebulino mswellhao matt-peters christinaliang yizhan42 frankey419 yndu13 xxcharles btcton jianchengss hanzmyco williamsysu chenghuige zphang eduardo95 nlpscott billhsia shiweiba george86028 isnowalarm bluegreenup wassname waiteryee1 cedrickchee caesarluvai chicken-dinnner dsl-light eva-n27 shreyanand marscfeng ewrfcas mmizutani colionx sybilwitch darlliu guhay coderbyr firejohnny w32zhong shafiahmed robinjia xdcesc

transformers's Issues

[Feature request ] Add support for the new cased version of the multilingual model

google-research/bert@332a687

Swapped to_seq_len/from_seq_len in comment

I'm pretty sure this comment:

https://github.com/huggingface/pytorch-pretrained-BERT/blob/2c5d993ba48841575d9c58f0754bca00b288431c/modeling.py#L339-L343

should instead say:

# Sizes are [batch_size, 1, 1, to_seq_length] 
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]

When masking out tokens for attention, it doesn't matter what happens to attention from padding tokens, only that there is no attention to padding tokens.

I don't believe the code is doing what the comment currently suggests because that would be an implementation flaw.

[Bug report] Ineffective no_decay when using BERTAdam

https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py#L505-L508

With this code, all parameters are decayed because the condition "parameter_name in no_decay" will never be satisfied.

I've made a PR #32 to fix it.

will you push the pytorch code for the pre-training process?

Can you push the pytorch code for the pre-training process,such as MLM task, please?
I really want to study, but I can't understand tensorflow, it's so complex.
thanks!!!

run_squad questions

Thanks a lot for the port! I have some minor questions, for the run_squad file, I see two options for accumulating gradients, accumulate_gradients and gradient_accumulation_steps but it seems to me that it can be combined into one. The other one is for the global_step variable, seems we are only counting but not using this variable in gradient accumulating. Thanks again!

Multilingual Issue

Dear authors,
I have two questions.

First, how can I use multilingual pre-trained BERT in pytorch?
Is it all download model to $BERT_BASE_DIR?

Second is tokenization issue.
For Chinese and Japanese, tokenizer may works, however, for Korean, it shows different result that I expected

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "안녕하세요"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

` ['ᄋ', '##ᅡ', '##ᆫ', '##ᄂ', '##ᅧ', '##ᆼ', '##ᄒ', '##ᅡ', '##ᄉ', '##ᅦ', '##ᄋ', '##ᅭ']

The result is based on not 'character' but 'byte-based character'
May it comes from unicode issue. (I expect ['안녕', '##하세요'])

Command-line interface Document Bug

There is a bug in README.md about Command-line interface:
export BERT_BASE_DIR=chinese_L-12_H-768_A-12

Wrong:

pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
   --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt.index \
   --bert_config_file $BERT_BASE_DIR/bert_config.json \
   --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin

Right:

pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
  $BERT_BASE_DIR/bert_model.ckpt.index \
  $BERT_BASE_DIR/bert_config.json \
  $BERT_BASE_DIR/pytorch_model.bin

`TypeError: object of type 'NoneType' has no len()` when tuning on squad

When running the following command for tuning on squad, I am getting a petty error inside logger TypeError: object of type 'NoneType' has no len(). Any thoughts what could be the main cause of the problem?

Full log:

 python3.6 examples/run_squad.py \
>   --bert_model bert-base-uncased \
>   --do_train \
>   --do_predict \
>   --train_file $SQUAD_DIR/train-v1.1.json \
>   --predict_file $SQUAD_DIR/dev-v1.1.json \
>   --train_batch_size 12 \
>   --learning_rate 3e-5 \
>   --num_train_epochs 2.0 \
>   --max_seq_length 384 \
>   --doc_stride 128 \
>   --output_dir out

.
.
.

11/29/2018 23:10:14 - INFO - __main__ -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/29/2018 23:10:14 - INFO - __main__ -   segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/29/2018 23:10:14 - INFO - __main__ -   start_position: 47
11/29/2018 23:10:14 - INFO - __main__ -   end_position: 48
11/29/2018 23:10:14 - INFO - __main__ -   answer: the 1870s
11/29/2018 23:14:38 - INFO - __main__ -     Saving train features into cached file /shared/shelley/khashab2/pytorch-pretrained-BERT/squad/train-v1.1.json_bert-base-uncased_384_128_64
11/29/2018 23:14:51 - INFO - __main__ -   ***** Running training *****
11/29/2018 23:14:51 - INFO - __main__ -     Num orig examples = 87599
Traceback (most recent call last):
  File "examples/run_squad.py", line 989, in <module>
    main()
  File "examples/run_squad.py", line 884, in main
    logger.info("  Num split examples = %d", len(train_features))
TypeError: object of type 'NoneType' has no len()

Missing function convert_to_unicode in tokenization.py

The function convert_to_unicode is not in tokenization.py but used to be there in v0.1.2. When fine tuning with run_classifier.py, you get an ImportError: cannot import name 'convert_to_unicode'.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/ce37b8e4819142171b61558e64f7dcb0286e9937/examples/run_classifier.py#L33

grad is None in squad example

Hi, guys, I try the run_squad example with

Traceback (most recent call last):                                                                                                                                                                 | 0/7331 [00:00<?, ?it/s]
  File "examples/run_squad.py", line 973, in <module>
    main()
  File "examples/run_squad.py", line 904, in main
    param.grad.data = param.grad.data / args.loss_scale
AttributeError: 'NoneType' object has no attribute 'data'

I find one of the param.grads is None, so the param.grad.data doesn't exist.
by the way I down load the data by myself from the urls in this prject. my os is ubuntu 18.04, pytorch 0.41 gpu 1080t

anyone else encounters this situation?
wanna help, please, thx in advance...

ValueError while using --optimize_on_cpu

Traceback (most recent call last): | 1/87970 [00:00<8:35:35, 2.84it/s]
File "./run_squad.py", line 990, in
main()
File "./run_squad.py", line 922, in main
is_nan = set_optimizer_params_grad(param_optimizer, model.named_parameters(), test_nan=True)
File "./run_squad.py", line 691, in set_optimizer_params_grad
if test_nan and torch.isnan(param_model.grad).sum() > 0:
File "/people/sanjay/anaconda2/envs/bert_pytorch/lib/python3.5/site-packages/torch/functional.py", line 289, in isnan
raise ValueError("The argument is not a tensor", str(tensor))
ValueError: ('The argument is not a tensor', 'None')

Command:
CUDA_VISIBLE_DEVICES=0 python ./run_squad.py
--vocab_file bert_large/uncased_L-24_H-1024_A-16/vocab.txt
--bert_config_file bert_large/uncased_L-24_H-1024_A-16/bert_config.json
--init_checkpoint bert_large/uncased_L-24_H-1024_A-16/pytorch_model.bin
--do_lower_case
--do_train
--do_predict
--train_file squad_dir/train-v1.1.json
--predict_file squad_dir/dev-v1.1.json
--learning_rate 3e-5
--num_train_epochs 2
--max_seq_length 384
--doc_stride 128
--output_dir outputs
--train_batch_size 4
--gradient_accumulation_steps 2
--optimize_on_cpu

Error while using --optimize_on_cpu only.
Works fine without the argument.

GPU: Nvidia GTX 1080Ti Single GPU.

PS: I can only fit in train_batch_size 4 on the memory of a single GPU.

Assertion `srcIndex < srcSelectDimSize` failed.

Sorry to bother you
I recently have used your extract_features.py to extract features of some data set but failed. The error information is as follows:
/opt/conda/conda-bld/pytorch_1532584813488/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [11,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSizefailed. Traceback (most recent call last): File "examples/extract_features.py", line 405, in <module> main() File "examples/extract_features.py", line 375, in main all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py", line 610, in forward output_all_encoded_layers=output_all_encoded_layers) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py", line 328, in forward hidden_states = layer_module(hidden_states, attention_mask) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py", line 313, in forward attention_output = self.attention(hidden_states, attention_mask) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py", line 273, in forward self_output = self.self(input_tensor, attention_mask) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py", line 224, in forward mixed_query_layer = self.query(hidden_states) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 55, in forward return F.linear(input, self.weight, self.bias) File "/home/jiaofangkai/anaconda3/envs/allennlp-env/lib/python3.7/site-packages/torch/nn/functional.py", line 1026, in linear output = input.matmul(weight.t()) RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1532584813488/work/aten/src/THC/THCGeneral.cpp:333
It seems that the index_select function in the models crashed. I read my own data from json files and construct examples from them. I set the batch-size equals 1 and I modified the max_seq_length to the max_length of the input sentences.
Thanks for your help!

Feature extraction for sequential labelling

Hi, I have a question in terms of using BERT for sequential labeling task.
Please correct me if I'm wrong.
My understanding is:

Use BertModel loaded with pretrained weights instead of MaskedBertModel.
In such case, take a sequence of tokens as input, BertModel would output a list of hidden states, I only use the top layer hidden states as the embedding for that sequence.
Then to fine tune the model, add a linear fully connected layer and softmax to make final decision.

Is this entire process correct? I followed this procedure but could not have any results.

Thank you!

issues with accents on convert_ids_to_tokens()

Hello, the BertTokenizer seems loose accents when convert_ids_to_tokens() is used :

Example:

original sentence: "great breakfasts in a nice furnished cafè, slightly bohemian."
corresponding list of token produced : ['great', 'breakfast', '##s', 'in', 'a', 'nice', 'fur', '##nis', '##hed', 'cafe', ',', 'slightly', 'bohemia', '##n', '.']

Here the problem is in "cafe" that loses its accent. I'm using BertTokenizer.from_pretrained('Bert-base-multilingual') as the tokenizer, I also tried with "Bert-base-uncased" and experienced the same issue.

Thanks for this great work!

How to detokenize a BertTokenizer output?

I was wondering if there's a proper way of detokenizing the output tokens, i.e., constructing the sentence back from the tokens? Considering the fact that the word-piece tokenisation introduces lots of #s.

pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py error

attributeError: 'BertForPreTraining' object has no attribute 'global_step'

example for is next sentence

Can you make up a working example for 'is next sentence'

Is this expected to work properly ?

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenized input
text = "Who was Jim Morrison ? Jim Morrison was a puppeteer"
tokenized_text = tokenizer.tokenize(text)

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

# Load pre-trained model (weights)
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
model.eval()

# Predict is Next Sentence ?
predictions = model(tokens_tensor, segments_tensors)

Checkpoints not saved

There is an option save_checkpoints_steps that seems to control checkpointing. However, there is no actual saving operation in the run_* scripts. So, should we add that functionality or remove this argument?

Is there a plan to have a FP16 for GPU so to have larger batch size or longer text documents support ?

Is there a plan to have an FP16 for GPU so to have a larger batch size or longer text documents support?

[Feature request] Add example of finetuning the pretrained models on custom corpus

Specify a model from a specific directory for extract_features.py

I have downloaded the model and vocab files into a specific location, using their original file names, so my directory for bert-base-cased contains:

bert-base-cased-vocab.txt
bert_config.json
pytorch_model.bin

But when I try to specify the directory which contains these files for the --bert_model parameter of extract_features.py I get the following error:

ValueError: Can't find a vocabulary file at path <THEDIRECTORYPATHISPECIFIED> ...

When I specify a file that exists and is a proper file, the error messages seem to indicate that the program wants to untar and uncompress the files.

Is there no way to just specify a specific directory that contains the vocab, config, and model files?

Typo in README

I think I spotted a typo in the README file under the Usage header. There is a piece of code that uses BertTokenizer and the typo is on this line:
tokenized_text = "Who was Jim Henson ? Jim Henson was a puppeteer"

I think tokenized_text should be replaced with text, since the next line is
tokenized_text = tokenizer.tokenize(text)

BERT model for Machine Translation

Is there a way to use any of the provided pre-trained models in the repository for machine translation task?

Thanks

py2 code

if I convert code to python2 version of code, it can't converage ; Would you present py2 code?

3 sentences as input for BertForSequenceClassification?

Hi there,

Thanks for releasing this awesome repo, it does lots people like me a great favor.

So far I've tried sentence-pair BertForSequenceClassification task, and it indeed work. I'd like to know if it is possible to use BertForSequenceClassification to model triple sentences classification problem and its input can be described as below:

**[CLS]A[SEP]B[SEP]C[SEP]**

Expecting for your reply!

Thanks & Regards

how to load checkpoint?

i download the model from bert, it only has model.ckpt.data，model.ckpt.meta and model.ckpt.index, i donnot which to load, what is checkpoint file for convert.py?

[Feature request] Port SQuAD 2.0 support

Recently the Google team added support for Squad 2.0:

google-research/bert@6045470

Would be great to also have it available in the Pytorch version.

Failure during pytest (and solution for python3)

foo@bar:~/foo/bar/pytorch-pretrained-BERT$ pytest -sv ./tests/
===================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.6.6, pytest-3.9.1, py-1.7.0, pluggy-0.8.0 -- /home/foo/.pyenv/versions/anaconda3-5.1.0/bin/python
cachedir: .pytest_cache
rootdir: /data1/users/foo/bar/pytorch-pretrained-BERT, inifile:
plugins: remotedata-0.3.0, openfiles-0.3.0, doctestplus-0.1.3, cov-2.6.0, arraydiff-0.2, flaky-3.4.0
collected 0 items / 3 errors

=========================================================================================================================== ERRORS ============================================================================================================================
___________________________________________________________________________________________________________ ERROR collecting tests/modeling_test.py ___________________________________________________________________________________________________________
ImportError while importing test module '/data1/users/foo/bar/pytorch-pretrained-BERT/tests/modeling_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/modeling_test.py:25: in <module>
    import modeling
E   ModuleNotFoundError: No module named 'modeling'
_________________________________________________________________________________________________________ ERROR collecting tests/optimization_test.py _________________________________________________________________________________________________________
ImportError while importing test module '/data1/users/foo/bar/pytorch-pretrained-BERT/tests/optimization_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/optimization_test.py:23: in <module>
    import optimization
E   ModuleNotFoundError: No module named 'optimization'
_________________________________________________________________________________________________________ ERROR collecting tests/tokenization_test.py _________________________________________________________________________________________________________
ImportError while importing test module '/data1/users/foo/bar/pytorch-pretrained-BERT/tests/tokenization_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/tokenization_test.py:22: in <module>
    import tokenization
E   ModuleNotFoundError: No module named 'tokenization'
===Flaky Test Report===


===End Flaky Test Report===
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 3 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=================================================================================================================== 3 error in 0.60 seconds ==================================================================================================================

In python 3, python -m pytest -sv tests/ works fine.

truncated normal initializer

I have a reasonable truncated normal approximation. (Actually that is what tf does).
https://discuss.pytorch.org/t/implementing-truncated-normal-initializer/4778/16?u=ruotianluo

Can not find vocabulary file for Chinese model

After I convert the TF model to pytorch model, I run a classification task on a new Chinese dataset, but get this:

CUDA_VISIBLE_DEVICES=3 python run_classifier.py --task_name weibo --do_eval --do_train --bert_model chinese_L-12_H-768_A-12 --max_seq_length 128 --train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir bert_result

11/18/2018 21:56:59 - INFO - main - device cuda n_gpu 1 distributed training False
11/18/2018 21:56:59 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file chinese_L-12_H-768_A-12
Traceback (most recent call last):
File "run_classifier.py", line 661, in
main()
File "run_classifier.py", line 508, in main
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
File "/home/lin/jpmorgan/pytorch-pretrained-BERT/pytorch_pretrained_bert/tokenization.py", line 141, in from_pretrained
tokenizer = cls(resolved_vocab_file, do_lower_case)
File "/home/lin/jpmorgan/pytorch-pretrained-BERT/pytorch_pretrained_bert/tokenization.py", line 94, in init
"model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)".format(vocab_file))
ValueError: Can't find a vocabulary file at path 'chinese_L-12_H-768_A-12'. To load the vocabulary from a Google pretrained model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

Fine-Tuned BERT-base on Squad v1.

I have fine-tuned the TF model on SQuAD v1 and I've made the weights available at: https://s3.eu-west-2.amazonaws.com/nlpfiles/squad_bert_base.tgz

I get 88.5 FM using these weights on SQuAD dev. (If I recall correctly I get roughly 82 EM).

I think it may be beneficial to have these weights here, so that people could play with SQuAD and BERT without the need of fine-tuning, which requires a decent enough setup. Let me know what you think!

model loading the checkpoint error

RuntimeError: Error(s) in loading state_dict for BertModel:
size mismatch for embeddings.token_type_embeddings.weight: copying a param of torch.Size([16, 768]) from checkpoint, where the shape is torch.Size([2, 768]) in current model.

MRPC hyperparameters question

When describing how you reproduced the MRPC results, you say:
"Our test ran on a few seeds with the original implementation hyper-parameters gave evaluation results between 82 and 87."
and you link to the SQuAD hyperparameters (https://github.com/google-research/bert#squad).

Is the link a mistake? Or did you use the SQuAD hyperparameters for tuning on MRPC? More generally, I'm wondering if there's a reason the MRPC dev set accuracy is slightly lower (in [82, 87] vs. [84, 88] reported by Google)

speed is very slow

convert samples to features, is very slow

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 3920: character maps to <undefined>

Installed pytorch-pretrained-BERT from source, Python 3.7, Windows 10

When I run the following snippet:

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

Load pre-trained model tokenizer (vocabulary)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

I get the following:

UnicodeDecodeError Traceback (most recent call last)
in ()
3
4 # Load pre-trained model tokenizer (vocabulary)
----> 5 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

~\Anaconda3\lib\site-packages\pytorch_pretrained_bert\tokenization.py in from_pretrained(cls, pretrained_model_name, do_lower_case)
139 vocab_file, resolved_vocab_file))
140 # Instantiate tokenizer.
--> 141 tokenizer = cls(resolved_vocab_file, do_lower_case)
142 except FileNotFoundError:
143 logger.error(

~\Anaconda3\lib\site-packages\pytorch_pretrained_bert\tokenization.py in init(self, vocab_file, do_lower_case)
93 "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
94 "model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)".format(vocab_file))
---> 95 self.vocab = load_vocab(vocab_file)
96 self.ids_to_tokens = collections.OrderedDict(
97 [(ids, tok) for tok, ids in self.vocab.items()])

~\Anaconda3\lib\site-packages\pytorch_pretrained_bert\tokenization.py in load_vocab(vocab_file)
68 with open(vocab_file, "r", encoding="utf8") as reader:
69 while True:
---> 70 token = convert_to_unicode(reader.readline())
71 if not token:
72 break

~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 3920: character maps to

Loss calculation error

https://github.com/huggingface/pytorch-pretrained-BERT/blob/982339d82984466fde3b1466f657a03200aa2ffb/pytorch_pretrained_bert/modeling.py#L744

Got ValueError: Expected target size (1, 30522), got torch.Size([1, 11]) at line 744 of modeling.py. I think the line should be changed to masked_lm_loss = loss_fct(prediction_scores.view([-1, self.config.vocab_size]), masked_lm_labels.view([-1])).

Multi-GPU training vs Distributed training

Hi,

I have a question about Multi-GPU vs Distributed training, probably unrelated to BERT itself.

I have a 4-GPU server, and was trying to run run_classifier.py in two ways:

(a) run single-node distributed training with 4 processes and minibatch of 32 each
(b) run Multi-GPU training with minibatch of 128, and all other hyperparams keep the same

Intuitively I believe a and b should yield the closed accuracy and training times. Below please find my observations:

(a) runs ~20% faster than (b).
(b) yields a better final evaluation accuracy of ~4% than (a)

The first looks like reasonable since I guess the loss.mean() is done by CPU which may be slower than using NCCL directly? However, I don't quite understand the second observation. Can you please give any hint or reference about the possible cause?

Thanks!

can you push the run-pretraining and create_pretraining_data codes?

just want to study codes, don't need to have same pre-train performance.

Accuracy on classification task is lower than the official tensorflow version

Hi, I am running the same task with the same hyper parameters as the official Google Tensorflow implementation of BERT, however, I am getting around 1.5% lower accuracy. Can you please give any hint about the possible cause?

Thanks！

Missing options/arguments in run_squad.py for BERT Large

Thanks for the great code..However, the run_squad.py for BERT Large seems to not have the vocab_file and bert_config_file (or other) options/arguments. Did you push the latest version?
Also, it is looking for a pytorch model file (a bin file). Does it need to be there?

I also had to add this line to the file to make BERT base to run on Squad 1.1:
parser.add_argument('--do_lower_case', action="store_true", default=True, help="Lowercase the input")

activation function in BERTIntermediate

BERTConfig is not used for BERTIntermediate's activation function. intermediate_act_fn is always gelu. Is this normal?

https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/modeling.py#L240

BERTConfigs in example usages in `modeling.py` are not OK (?)

Hi!

In the config definition https://github.com/huggingface/pytorch-pretrained-BERT/blob/21f0196412115876da1c38652d22d1f7a14b36ff/pytorch_pretrained_bert/modeling.py#L848
in the Example usage of BertForSequenceClassification in modeling.py, there's things I don't understand:

vocab_size in not an acceptable parameter name, by looking at the BertConfig class definition https://github.com/huggingface/pytorch-pretrained-BERT/blob/21f0196412115876da1c38652d22d1f7a14b36ff/pytorch_pretrained_bert/modeling.py#L70
even by changing vocab_size into vocab_size_or_config_json_file, for the choice of the other params given in the example i.e.
vocab_size=32000, hidden_size=512, num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024
I get:
ValueError: The hidden size (512) is not a multiple of the number of attention heads (6)
I think that something similar may be true for the other classes as well, BertForQuestionAnswering, BertForNextSentencePrediction, etc.

Am I missing something?

using BERT as a language Model

I was trying to use BERT as a language model to assign a score(could be PPL score) of a given sentence. Something like
P("He is go to school")=0.008
P("He is going to school")=0.08
Which is indicating that the probability of second sentence is higher than first sentence. Is there a way to get a score like this?

Thanks

not good when I use BERT for seq2seq model in keyphrase generation

Hi,

recently, I am researching about Keyphrase generation. Usually, people use seq2seq with attention model to deal with such problem. Specifically I use the framework: https://github.com/memray/seq2seq-keyphrase-pytorch, which is implementation of http://memray.me/uploads/acl17-keyphrase-generation.pdf .

Now I just change its encoder part to BERT, but the result is not good. The experiment comparison of two models is in the attachment.

Can you give me some advice if what I did is reasonable and if BERT is suitable for doing such a thing?

Thanks.
RNN vs BERT in Keyphrase generation.pdf

example in BertForSequenceClassification() conflicts with the api

Hi, firstly, admire u for the great job. but I encounter 2 problems when i use it:
1. UnicodeDecodeError: 'gbk' codec can't decode byte 0x85 in position 4527: illegal multibyte sequence,
same problem as ISSUE 52 when I excute the BertTokenizer.from_pretrained('bert-base-uncased'), but I successfully excute BertForNextSentencePrediction.from_pretrained('bert-base-uncased'), >.<
2. in the pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py,
line 761 --> ```
token_type_ids: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).

but in the following example,  in **line 784**-->     `token_type_ids = torch.LongTensor([[0, 0, 1], [0, **2**, 0]])`, why the '2' appears?  I am confused.  Otherwise, is the situation similar to '0, 1, 0 ' correct ? Or it should be similar to [000000111111] , that is continuous '0' and continuous '1' ?
ty.

Issue of `bert_model` arg in `run_classify.py`

Hi,

I am trying to understand the bert_model arg in run_classify.py. In the file, I can see

tokenizer = BertTokenizer.from_pretrained(args.bert_model)

where bert_model is expected to be the vocab text file of the model

However, I also see

model = BertForSequenceClassification.from_pretrained(args.bert_model, len(label_list))

where bert_model is expected to be a archive file containing the model checkpoint and config.

Please help to advice the correct use of bert_model if I have my pretrained model converted locally already.

Thanks!

Crash at the end of training

Hi, I tried running the Squad model this morning (on a single GPU with gradient accumulation over 3 steps) but after 3 hours of training, my job failed with the following output:

I was running the code, unmodified, from commit 3bfbc21

Is this an issue you know about?

11/08/2018 17:50:03 - INFO - __main__ -   device cuda n_gpu 1 distributed training False
11/08/2018 17:50:18 - INFO - __main__ -   *** Example ***
11/08/2018 17:50:18 - INFO - __main__ -   unique_id: 1000000000
11/08/2018 17:50:18 - INFO - __main__ -   example_index: 0
11/08/2018 17:50:18 - INFO - __main__ -   doc_span_index: 0
11/08/2018 17:50:18 - INFO - __main__ -   tokens: [CLS] to whom did the virgin mary allegedly appear in 1858 in lou ##rdes france ? [SEP] architectural ##ly , the school has a catholic character . atop the main building ' s gold dome is a golden statue of the virgin mary . immediately in front of the main building and facing it , is a copper statue of christ with arms up ##rai ##sed with the legend " ve ##ni ##te ad me om ##nes " . next to the main building is the basilica of the sacred heart . immediately behind the basilica is the gr ##otto , a marian place of prayer and reflection . it is a replica of the gr ##otto at lou ##rdes , france where the virgin mary reputed ##ly appeared to saint bern ##ade ##tte so ##ub ##iro ##us in 1858 . at the end of the main drive ( and in a direct line that connects through 3 statues and the gold dome ) , is a simple , modern stone statue of mary . [SEP]
11/08/2018 17:50:18 - INFO - __main__ -   token_to_orig_map: 17:0 18:0 19:0 20:1 21:2 22:3 23:4 24:5 25:6 26:6 27:7 28:8 29:9 30:10 31:10 32:10 33:11 34:12 35:13 36:14 37:15 38:16 39:17 40:18 41:19 42:20 43:20 44:21 45:22 46:23 47:24 48:25 49:26 50:27 51:28 52:29 53:30 54:30 55:31 56:32 57:33 58:34 59:35 60:36 61:37 62:38 63:39 64:39 65:39 66:40 67:41 68:42 69:43 70:43 71:43 72:43 73:44 74:45 75:46 76:46 77:46 78:46 79:47 80:48 81:49 82:50 83:51 84:52 85:53 86:54 87:55 88:56 89:57 90:58 91:58 92:59 93:60 94:61 95:62 96:63 97:64 98:65 99:65 100:65 101:66 102:67 103:68 104:69 105:70 106:71 107:72 108:72 109:73 110:74 111:75 112:76 113:77 114:78 115:79 116:79 117:80 118:81 119:81 120:81 121:82 122:83 123:84 124:85 125:86 126:87 127:87 128:88 129:89 130:90 131:91 132:91 133:91 134:92 135:92 136:92 137:92 138:93 139:94 140:94 141:95 142:96 143:97 144:98 145:99 146:100 147:101 148:102 149:102 150:103 151:104 152:105 153:106 154:107 155:108 156:109 157:110 158:111 159:112 160:113 161:114 162:115 163:115 164:115 165:116 166:117 167:118 168:118 169:119 170:120 171:121 172:122 173:123 174:123
11/08/2018 17:50:18 - INFO - __main__ -   token_is_max_context: 17:True 18:True 19:True 20:True 21:True 22:True 23:True 24:True 25:True 26:True 27:True 28:True 29:True 30:True 31:True 32:True 33:True 34:True 35:True 36:True 37:True 38:True 39:True 40:True 41:True 42:True 43:True 44:True 45:True 46:True 47:True 48:True 49:True 50:True 51:True 52:True 53:True 54:True 55:True 56:True 57:True 58:True 59:True 60:True 61:True 62:True 63:True 64:True 65:True 66:True 67:True 68:True 69:True 70:True 71:True 72:True 73:True 74:True 75:True 76:True 77:True 78:True 79:True 80:True 81:True 82:True 83:True 84:True 85:True 86:True 87:True 88:True 89:True 90:True 91:True 92:True 93:True 94:True 95:True 96:True 97:True 98:True 99:True 100:True 101:True 102:True 103:True 104:True 105:True 106:True 107:True 108:True 109:True 110:True 111:True 112:True 113:True 114:True 115:True 116:True 117:True 118:True 119:True 120:True 121:True 122:True 123:True 124:True 125:True 126:True 127:True 128:True 129:True 130:True 131:True 132:True 133:True 134:True 135:True 136:True 137:True 138:True 139:True 140:True 141:True 142:True 143:True 144:True 145:True 146:True 147:True 148:True 149:True 150:True 151:True 152:True 153:True 154:True 155:True 156:True 157:True 158:True 159:True 160:True 161:True 162:True 163:True 164:True 165:True 166:True 167:True 168:True 169:True 170:True 171:True 172:True 173:True 174:True
11/08/2018 17:50:18 - INFO - __main__ -   input_ids: 101 2000 3183 2106 1996 6261 2984 9382 3711 1999 8517 1999 10223 26371 2605 1029 102 6549 2135 1010 1996 2082 2038 1037 3234 2839 1012 10234 1996 2364 2311 1005 1055 2751 8514 2003 1037 3585 6231 1997 1996 6261 2984 1012 3202 1999 2392 1997 1996 2364 2311 1998 5307 2009 1010 2003 1037 6967 6231 1997 4828 2007 2608 2039 14995 6924 2007 1996 5722 1000 2310 3490 2618 4748 2033 18168 5267 1000 1012 2279 2000 1996 2364 2311 2003 1996 13546 1997 1996 6730 2540 1012 3202 2369 1996 13546 2003 1996 24665 23052 1010 1037 14042 2173 1997 7083 1998 9185 1012 2009 2003 1037 15059 1997 1996 24665 23052 2012 10223 26371 1010 2605 2073 1996 6261 2984 22353 2135 2596 2000 3002 16595 9648 4674 2061 12083 9711 2271 1999 8517 1012 2012 1996 2203 1997 1996 2364 3298 1006 1998 1999 1037 3622 2240 2008 8539 2083 1017 11342 1998 1996 2751 8514 1007 1010 2003 1037 3722 1010 2715 2962 6231 1997 2984 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/08/2018 17:50:18 - INFO - __main__ -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

... [truncated] ...

Iteration: 100%|█████████▉| 29314/29324 [3:27:55<00:04,  2.36it/s]�[A

Iteration: 100%|█████████▉| 29315/29324 [3:27:55<00:03,  2.44it/s]�[A

Iteration: 100%|█████████▉| 29316/29324 [3:27:56<00:03,  2.26it/s]�[A

Iteration: 100%|█████████▉| 29317/29324 [3:27:56<00:02,  2.35it/s]�[A

Iteration: 100%|█████████▉| 29318/29324 [3:27:56<00:02,  2.44it/s]�[A

Iteration: 100%|█████████▉| 29319/29324 [3:27:57<00:02,  2.25it/s]�[A

Iteration: 100%|█████████▉| 29320/29324 [3:27:57<00:01,  2.35it/s]�[A

Iteration: 100%|█████████▉| 29321/29324 [3:27:58<00:01,  2.41it/s]�[A

Iteration: 100%|█████████▉| 29322/29324 [3:27:58<00:00,  2.25it/s]�[A

Iteration: 100%|█████████▉| 29323/29324 [3:27:59<00:00,  2.36it/s]�[ATraceback (most recent call last):
  File "code/run_squad.py", line 929, in <module>
    main()
  File "code/run_squad.py", line 862, in main
    loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/0x0d4ff90d01fa4168983197b17d73bb0c_dependencies/code/modeling.py", line 467, in forward
    start_loss = loss_fct(start_logits, start_positions)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 862, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1403, in nll_loss
    if input.size(0) != target.size(0):
RuntimeError: dimension specified as 0 but tensor has no dimensions

Exception ignored in: <bound method tqdm.__del__ of Iteration: 100%|█████████▉| 29323/29324 [3:27:59<00:00,  2.36it/s]>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 931, in __del__
    self.close()
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 1133, in close
    self._decr_instances(self)
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 496, in _decr_instances
    cls.monitor.exit()
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/usr/lib/python3.6/threading.py", line 1053, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

Unseen Vocab

Thank you so much for this well-documented and easy-to-understand implementation! I remember meeting you at WeCNLP and am so happy to see you push out usable implementations of the SOA in pytorch for the community!!!!!

I have a question: The convert_tokens_to_ids method in the BertTokenizer that provides input to the BertEncoder uses an OrderedDict for the vocab attribute, which throws an error (e.g. KeyError: 'ketorolac') for any words not in the vocab. Can I create another vocab object that adds unseen words and use that in the tokenizer? Does the pretrained BertEncoder depend on the default id mapping?

It seems to me that ideally in the long-term, this repo would incorporate character level embeddings to deal with unseen words, but idk if that is necessary for this use-case.

Race condition when prepare pretrained model in distributed training

Hi,

I launched two processes per node to run distributed run_classifier.py. However, I am occasionally get below error:

11/20/2018 09:31:48 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmpa25_y4es to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba

 93%|█████████▎| 381028352/407873900 [00:11<00:01, 14366075.22B/s]
 94%|█████████▍| 383812608/407873900 [00:11<00:01, 16210783.00B/s]
 95%|█████████▍| 386455552/407873900 [00:11<00:01, 16205260.89B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   creating metadata file for /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   removing temp file /tmp/tmpa25_y4es

 95%|█████████▌| 388946944/407873900 [00:11<00:01, 18097539.03B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpvxvnr8_1

 97%|█████████▋| 393660416/407873900 [00:11<00:00, 22199883.93B/s]
 98%|█████████▊| 399411200/407873900 [00:11<00:00, 27211860.00B/s]
 99%|█████████▉| 405128192/407873900 [00:11<00:00, 32287252.94B/s]
100%|██████████| 407873900/407873900 [00:11<00:00, 34098120.40B/s]
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmp5fcm4v8x to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
Traceback (most recent call last):
  File "examples/run_classifier.py", line 629, in <module>
    main()
  File "examples/run_classifier.py", line 485, in main
    model = BertForSequenceClassification.from_pretrained(args.bert_model, len(label_list))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/site-packages/pytorch_pretrained_bert-0.1.2-py3.6.egg/pytorch_pretrained_bert/modeling.py", line 495, in from_pretrained
    archive.extractall(tempdir)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2007, in extractall
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2049, in extract
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2119, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2168, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 248, in copyfileobj
    buf = src.read(bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

It looks like a race-condition that two processes are simultaneously writing model file to /root/.pytorch_pretrained_bert/.

Please help to advice any workaround. Thanks!

Bug in run_classifier.py

If I am running only evaluation and not training, there are errors as tr_loss and nb_tr_steps are undefined.