GithubHelp home page GithubHelp logo

codertimo / bert-pytorch Goto Github PK

View Code? Open in Web Editor NEW
6.0K 125.0 1.3K 101 KB

Google AI 2018 BERT pytorch implementation

License: Apache License 2.0

Python 99.83% Makefile 0.17%
bert transformer pytorch nlp language-model

bert-pytorch's Introduction

BERT-pytorch

LICENSE GitHub issues GitHub stars CircleCI PyPI PyPI - Status Documentation Status

Pytorch implementation of Google AI's 2018 BERT, with simple annotation

BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper URL : https://arxiv.org/abs/1810.04805

Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), including outperform the human F1 score on SQuAD v1.1 QA task. This paper proved that Transformer(self-attention) based encoder can be powerfully used as alternative of previous language model with proper language model training method. And more importantly, they showed us that this pre-trained language model can be transfer into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history, and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly. Some of these codes are based on The Annotated Transformer

Currently this project is working on progress. And the code is not verified yet.

Installation

pip install bert-pytorch

Quickstart

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

0. Prepare your corpus

Welcome to the \t the jungle\n
I can stay \t here all night\n

or tokenized corpus (tokenization is not in package)

Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n

1. Building vocab based on your corpus

bert-vocab -c data/corpus.small -o data/vocab.small

2. Train your own BERT model

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

Language Model Pre-training

In the paper, authors shows the new language model training methods, which are "masked language model" and "predict next sentence".

Masked Language Model

Original Paper : 3.3.1 Task #1: Masked LM

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Randomly 15% of input token will be changed into something, based on under sub-rules

  1. Randomly 80% of tokens, gonna be a [MASK] token
  2. Randomly 10% of tokens, gonna be a [RANDOM] token(another word)
  3. Randomly 10% of tokens, will be remain as same. But need to be predicted.

Predict Next Sentence

Original Paper : 3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is not directly captured by language modeling

Rules:

  1. Randomly 50% of next sentence, gonna be continuous sentence.
  2. Randomly 50% of next sentence, gonna be unrelated sentence.

Author

Junseong Kim, Scatter Lab ([email protected] / [email protected])

License

This project following Apache 2.0 License as written in LICENSE file

Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors

Copyright (c) 2018 Alexander Rush : The Annotated Trasnformer

bert-pytorch's People

Contributors

0xflotus avatar artemisart avatar codertimo avatar jeonsworld avatar zhupengjia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert-pytorch's Issues

Very low GPU usage when training on 8 GPU in a single machine

Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  10296MiB / 16152MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   37C    P0    55W / 300W |   2742MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |   2742MiB / 16152MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   47C    P0    61W / 300W |   2742MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    98W / 300W |   2742MiB / 16152MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   38C    P0    88W / 300W |   2736MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   48C    P0    80W / 300W |   2736MiB / 16152MiB |     25%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   46C    P0    71W / 300W |   2736MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+

I am not familiar with pytorch. Any one konws why?

the format of input

You mentioned that

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

and gave an example:

Welcome to the \t the jungle\n
I can stay \t here all night\n

However, the example is actually ONE sentence in one line.
Should it be:

Welcome to the jungle \t I can stay here all night\n

(suppose these two sentences are continuous in the broader context)

Is it possible to train BERT?

Is it possible to achieve the same result as the paper in short time?
Well.. I don't have enough GPU & computation power to see the enough result as google ai.

If we can't train the full corpus as the google, then how can we prove that this code is verified?
Training 256M size corpus without Google AI class gpu computation is nearly, impossible for me.

If you have any thought(reducing the model size) please let me know!

Question about random sampling.

prob = random.random()
if prob < 0.15:
# 80% randomly change token to make token
if prob < prob * 0.8:
tokens[i] = self.vocab.mask_index
# 10% randomly change token to random token
elif prob * 0.8 <= prob < prob * 0.9:
tokens[i] = random.randrange(len(self.vocab))
# 10% randomly change token to current token
elif prob >= prob * 0.9:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

Well, seems random.random() always returns a positive number, so prob >= prob * 0.9 will always be true?

Single Sentence Input support

In the paper, they note that they optionally use single sentence input for some classification tasks. I'll try to take a look at doing it myself, as it looks like it is not currently supported.

Making Book Corpus

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

how to test the model?

Could you please give some example code to load in the pre-trained model(i.e. bert.model.ep0 files)?
The code might take me a while to understand so I really appreciate it if you can help me.

How to embedding segment lable

Thanks for you code ,which let me leran more details for this papper .But i cant't understand segment.py. You haven't writeen how to embedding segment lable .

Mask language model loss

Hi,
Thank you for your clean code on Bert. I have a question about Mask LM loss after I read your code. Your program computes a mask language model loss on both positive sentence pairs and negative pairs.

Does it make sense to compute Mask LM loss on negative sentence pairs? I am not sure how Google computes this loss.

Vocab Replace \t to blank issue

when the corpus is:
how are you \ tnice to meet you
and apply bert-vocab cmd, the output of the vacab is
['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'you', 'are', 'how', 'meet', 'nice', 'to'].
But when change the corputs to
how are you\tnice to meet you, the result is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'are', 'how', 'meet', 'to', 'you', 'younice'], the last token become younice.
a <'blank'> need on both sides of <'\t'>.
it's may not a bug.

Example of Input Data

Could you give a concrete example of the input data? You gave an example of the corpus data, but not the dataset.small file found in this line:

bert -c data/dataset.small -v data/vocab.small -o output/bert.model

If you could show perhaps a couple of examples, that would be very helpful! I am new to pytorch, so the dataloader function is a little confusing.

Tensor transform question in pretrain.py

There is a line like below in pretrain.py,

mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

I run it, and find "mask_lm_output" is like "batch_sizeinput_lengthvocab_size", and "data["bert_label"]" like "batch_size*input_length", if transpose as above, Does it make sense ? I am confused.

Maybe Bugs?

In pretrain.py, save() methods. I guess self.bert.to(self.device) should be removed..right?

Pretrained model transfer to pytorch

Well all of you guys know, it's nearly impossible to train from the scratch, because of lack of computation power. So I'm going to implement the transfer code for making pretrained model can be supported on pytorch too.

This implementation will be started when the Google release their official BERT code and pretrained model. If anyone interested to join this work, please leave the comment underside.

Thank you everyone who carefully watching this project👍
By Junseong Kim

self.d_k = d_model // h gives 64 dimension ?

Looks that self.d_k = d_model // h ---> embed size 768 dividing number of heads 12 = 64

        self.d_k = d_model // h # 64
        self.h = h # 12 heads
        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
                             for l, x in zip(self.linear_layers, (query, key, value))]

why convert 768 dimensional [q,v,k] into 64 dimension embedding ?

Reference:
http://nlp.seas.harvard.edu/2018/04/03/attention.html
I put some comments on the shape:

class MultiHeadedAttention(nn.Module): # d_model=512, h=8
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h # 512//8=64
        self.h = h # 8
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k) # (nbatches, -1, 512)
        return self.linears[-1](x)

Why doesn't the counter in data_iter increase?

I am currently playing around with training and testing the model. However, as I implemented the test section, I'm noticing that during the LM training, your counter doesn't increase when looping over data_iter found in pretrain.py. This would cause problems when calculating the average loss/accuracy, wouldn't it?

image

Making Wikipedia Corpus

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

Attention maybe changed

Hi, Thanks for your great job.
I wonder that the attention mechanism of your code seems to be changed.
The shape of attention vector should be (batch, timestep, timestep), but according to your code, the shape of self attention vector is (batch, timestep, hidden_size). There is new code that I fixed below. Please review it and appreciate your comments. Thank you.

`
class Attention(nn.Module):
def init(self, num_hidden, h=8):
super(Attention, self).init()

    self.num_hidden_per_attn = num_hidden // h
    self.h = h
    
    self.key = nn.Linear(num_hidden, num_hidden)
    self.value = nn.Linear(num_hidden, num_hidden)
    self.query = nn.Linear(num_hidden, num_hidden)
    
    self.layer_norm_1 = LayerNorm(num_hidden)
    self.layer_norm_2 = LayerNorm(num_hidden)
    self.out_linear = nn.Linear(num_hidden, num_hidden)
    
    self.dropout = nn.Dropout(p=0.1)
    
def forward(self, input_):
    batch_size = input_.size(0)
    
    key = F.relu(self.key(input_))
    value = F.relu(self.value(input_))
    query = F.relu(self.query(input_))
    
    key, value, query = list(map(lambda x: x.view(batch_size, -1, self.h, self.num_hidden_per_attn), (key, value, query)))
    params = [(key[:,:,i,:], value[:,:,i,:], query[:,:,i,:]) for i in range(self.h)]

    _attn = list(map(self._multihead, params))
    attn = list(map(lambda x: x[0], _attn))
    probs = list(map(lambda x: x[1], _attn))
    result = t.cat(attn, -1)

    result = self.dropout(result)
    result = result.view(batch_size, -1, self.h * self.num_hidden_per_attn)
    
    # residual connection
    result = self.layer_norm_1(F.relu(input_ + result))
    
    out = self.out_linear(result)
    out = self.layer_norm_2(F.relu(result + out))
    
    return result, probs

def _multihead(self, params):

    key, value, query = params[0], params[1], params[2]

    attn = t.bmm(query, key.transpose(1,2)) / math.sqrt(key.shape[-1])

    attn = F.softmax(attn, dim=-1)
    result = t.bmm(attn, value)

    return result, attn

`

model/embedding/position.py

div_term = (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp()
should be:
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

In [51]: (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp()
...:
Out[51]:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Additional question:
I don't quite understand how "bidirectional" transformer in the raw paper implemented. Maybe like BiLSTM: concat two direction's transformer output together? Didn't find the similar structure in your code.

readme has wrong commands

bert -c data/dataset.small -v data/vocab.small -o output/bert.model

should be

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

according to

bert-vocab -c data/corpus.small -o data/vocab.small

The LayerNorm implementation

I am wondering why don't you use the standard nn version of LayerNorm?
I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}

Could you clarify these 2 approaches?

The question about the implement of learning_rate

Nice implements! However, I have a question about learning rate. The learning_rate schedule which from the origin Transformers is warm-up restart, but your implement just simple decay. Could you implement it in your BERT code?

Erroneous code

if prob < prob * 0.8:
tokens[i] = self.vocab.mask_index
# 10% randomly change token to random token
elif prob * 0.8 <= prob < prob * 0.9:
tokens[i] = random.randrange(len(self.vocab))
# 10% randomly change token to current token
elif prob >= prob * 0.9:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

This code is incorrect - it will always go into the last if clause. For instance, prob < prob * 0.8 is never true.

segmentation fault

When the code's Dataset class is modified, the error occurs. I try to debug the error, the code except at line: loss.backward().
image

PositionalEmbedding

The position embedding in the BERT is not the same as in the transformer. Why not use the form in bert?

Question about the loss of Masked LM

Thank you very much for this great contribution.
I found the loss of masked LM didn't decrease when it reaches the value around 7. However, in the official tensorflow implementation, the loss of MLM decreases to 1 easily. I think something went wrong in your implementation.
In additional, I found the code can not predict the next sentence correctly. I think the reason is: self.criterion = nn.NLLLoss(ignore_index=0). It can not be used as criterion for sentence prediction because the label of sentence is 1 or 0. We should remove ignore_index=0 for sentence prediction.
I am looking forward to your reply~

Tie the input and output embedding?

I think it's reasonable to tie the input and output embedding. Especially the output embedding along each token. But I still can't get a way to do this. Any one give an idea?

when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

According to the code

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

Do we need to exclude the unmasked words when training the LM?

[BERT] Cannot import bert

I have problems importing bert when following http://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html

(mxnet_p36) [ec2-user@master ~]$ ipython
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import warnings
   ...: warnings.filterwarnings('ignore')
   ...:
   ...: import random
   ...: import numpy as np
   ...: import mxnet as mx
   ...: from mxnet import gluon
   ...: import gluonnlp as nlp
   ...:
   ...:


In [2]:

In [2]: np.random.seed(100)
   ...: random.seed(100)
   ...: mx.random.seed(10000)
   ...: ctx = mx.gpu(0)
   ...:
   ...:

In [3]: from bert import *
   ...:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-40b999f3ea6a> in <module>()
----> 1 from bert import *

ModuleNotFoundError: No module named 'bert'

Looks gluonnlp are successfully installed. Any idea?

(mxnet_p36) [ec2-user@master site-packages]$ ll /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg
-rw-rw-r-- 1 ec2-user ec2-user 499320 Dec 28 23:15 /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg

pred_loss decrease fast while avg_acc stay at 50%

I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.
image

imbalance GPU memory usage

Hi,

Nice try for BERT implementation.

I try to run your code in 4V100 and I find the memory usage is imbalance: the first GPU consume 2x memory than the others. Any idea about the reason?

Btw, I think the parameter order in train.py line 64 is incorrect.

How does Next-Sentence-Prediction benefit to both QA and NLI?

The input for the one bidirectional transformer to pretrain is two sentences' concatenation.
I think the one bidirectional transformer is 'storing' the info of two sentences.
But in QA and NLI, we have two transformers and each transformer's input is one sentence.

Is there any result?

Is there any result that can be compared to the raw paper?Looking forward to your updates.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.