codertimo / bert-pytorch Goto Github PK

View Code? Open in Web Editor NEW

6.0K 125.0 1.3K 101 KB

Google AI 2018 BERT pytorch implementation

License: Apache License 2.0

Python 99.83% Makefile 0.17%

bert transformer pytorch nlp language-model

bert-pytorch's Introduction

BERT-pytorch

Pytorch implementation of Google AI's 2018 BERT, with simple annotation

BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper URL : https://arxiv.org/abs/1810.04805

Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), including outperform the human F1 score on SQuAD v1.1 QA task. This paper proved that Transformer(self-attention) based encoder can be powerfully used as alternative of previous language model with proper language model training method. And more importantly, they showed us that this pre-trained language model can be transfer into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history, and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly. Some of these codes are based on The Annotated Transformer

Currently this project is working on progress. And the code is not verified yet.

Installation

pip install bert-pytorch

Quickstart

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

0. Prepare your corpus

Welcome to the \t the jungle\n
I can stay \t here all night\n

or tokenized corpus (tokenization is not in package)

Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n

1. Building vocab based on your corpus

bert-vocab -c data/corpus.small -o data/vocab.small

2. Train your own BERT model

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

Language Model Pre-training

In the paper, authors shows the new language model training methods, which are "masked language model" and "predict next sentence".

Masked Language Model

Original Paper : 3.3.1 Task #1: Masked LM

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Randomly 15% of input token will be changed into something, based on under sub-rules

Randomly 80% of tokens, gonna be a [MASK] token
Randomly 10% of tokens, gonna be a [RANDOM] token(another word)
Randomly 10% of tokens, will be remain as same. But need to be predicted.

Predict Next Sentence

Original Paper : 3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is not directly captured by language modeling

Rules:

Randomly 50% of next sentence, gonna be continuous sentence.
Randomly 50% of next sentence, gonna be unrelated sentence.

Author

Junseong Kim, Scatter Lab ([email protected] / [email protected])

License

This project following Apache 2.0 License as written in LICENSE file

bert-pytorch's People

Contributors

Stargazers

Watchers

Forkers

chenfei-wu burness merajat qsevent hitum-dev cedrickchee freedomkite zhyq zhouyonglong allensmile jeanru zorrock eliotpbrenner daiwk giserh scapeqin wanjinchang sayduke shinichr binyi10 machenfeng seanlee97 uppet yjfiejd jswhy shugao0810 jianchengss airob henryflee vangogh0318 xiedake anirband sysujayce shubhampachori12110095 liben2018 dayu321 zhanzecheng xavier66 bluegreenup 0xflotus gjylt shaunstanislauslau super-louis mjc14 hongshunyang tk1363704 tianforks joseph-chan chiuyeelau devhttps onpoeet vitvicky jiasir803 mysee1989 carolzxyzxy casillas-qf chunde fgdbtkd hhy5277 sjyttkl myvrml alchemist1024 lu839684437 binwone dream1202 lixinsu li10141110 btbujiangjun samangel93 pku-wuwei wurentidai shengleih nothinglz jackylee1 frcmail yuanjie-ai entn-at highclow lichaoliu666 oppa3109 sharmer156 apoet gysan gdsttian mengbinzhu luojianp luciencho hipercube hczheng cosecant-csc fssqawj fbfra xuehaouwa dgreen2017 kylecharles 605883732 loicgrobol nathaliewang intuitionmachine blay12cedric-zz

bert-pytorch's Issues

Very low GPU usage when training on 8 GPU in a single machine

Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  10296MiB / 16152MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   37C    P0    55W / 300W |   2742MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |   2742MiB / 16152MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   47C    P0    61W / 300W |   2742MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    98W / 300W |   2742MiB / 16152MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   38C    P0    88W / 300W |   2736MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   48C    P0    80W / 300W |   2736MiB / 16152MiB |     25%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   46C    P0    71W / 300W |   2736MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+

I am not familiar with pytorch. Any one konws why?

the format of input

You mentioned that

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

and gave an example:

Welcome to the \t the jungle\n
I can stay \t here all night\n

However, the example is actually ONE sentence in one line.
Should it be:

Welcome to the jungle \t I can stay here all night\n

(suppose these two sentences are continuous in the broader context)

what’s your data set?

Is it possible to train BERT?

Is it possible to achieve the same result as the paper in short time?
Well.. I don't have enough GPU & computation power to see the enough result as google ai.

If we can't train the full corpus as the google, then how can we prove that this code is verified?
Training 256M size corpus without Google AI class gpu computation is nearly, impossible for me.

If you have any thought(reducing the model size) please let me know!

Question about random sampling.

BERT-pytorch/bert_pytorch/dataset/dataset.py

Lines 50 to 64 in 7efd2b5

 prob = random.random() 

 if prob < 0.15: 

 # 80% randomly change token to make token 

 if prob < prob * 0.8: 

 tokens[i] = self.vocab.mask_index 

 # 10% randomly change token to random token 

 elif prob * 0.8 <= prob < prob * 0.9: 

 tokens[i] = random.randrange(len(self.vocab)) 

 # 10% randomly change token to current token 

 elif prob >= prob * 0.9: 

 tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) 

 output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

Well, seems random.random() always returns a positive number, so prob >= prob * 0.9 will always be true?

Single Sentence Input support

In the paper, they note that they optionally use single sentence input for some classification tasks. I'll try to take a look at doing it myself, as it looks like it is not currently supported.

made a script to generate bert pre-train data

the script is similar to https://github.com/google-research/bert/blob/master/create_pretraining_data.py from google-research.
it can convert a document into bert trainning data

Making Book Corpus

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

how to test the model?

Could you please give some example code to load in the pre-trained model(i.e. bert.model.ep0 files)?
The code might take me a while to understand so I really appreciate it if you can help me.

How to embedding segment lable

Thanks for you code ,which let me leran more details for this papper .But i cant't understand segment.py. You haven't writeen how to embedding segment lable .

Mask language model loss

Hi,
Thank you for your clean code on Bert. I have a question about Mask LM loss after I read your code. Your program computes a mask language model loss on both positive sentence pairs and negative pairs.

Does it make sense to compute Mask LM loss on negative sentence pairs? I am not sure how Google computes this loss.

would you provide some sample datasets for demo the pre-training

Vocab Replace \t to blank issue

when the corpus is:
how are you \ tnice to meet you
and apply bert-vocab cmd, the output of the vacab is
['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'you', 'are', 'how', 'meet', 'nice', 'to'].
But when change the corputs to
how are you\tnice to meet you, the result is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'are', 'how', 'meet', 'to', 'you', 'younice'], the last token become younice.
a <'blank'> need on both sides of <'\t'>.
it's may not a bug.

ner

Example of Input Data

Could you give a concrete example of the input data? You gave an example of the corpus data, but not the dataset.small file found in this line:

bert -c data/dataset.small -v data/vocab.small -o output/bert.model

If you could show perhaps a couple of examples, that would be very helpful! I am new to pytorch, so the dataloader function is a little confusing.

Tensor transform question in pretrain.py

There is a line like below in pretrain.py,

mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

I run it, and find "mask_lm_output" is like "batch_sizeinput_lengthvocab_size", and "data["bert_label"]" like "batch_size*input_length", if transpose as above, Does it make sense ? I am confused.

Maybe Bugs?

In pretrain.py, save() methods. I guess self.bert.to(self.device) should be removed..right?

Pretrained model transfer to pytorch

Well all of you guys know, it's nearly impossible to train from the scratch, because of lack of computation power. So I'm going to implement the transfer code for making pretrained model can be supported on pytorch too.

This implementation will be started when the Google release their official BERT code and pretrained model. If anyone interested to join this work, please leave the comment underside.

Thank you everyone who carefully watching this project👍
By Junseong Kim

self.d_k = d_model // h gives 64 dimension ?

BERT-pytorch/bert_pytorch/model/attention/multi_head.py

Line 15 in d10dc4f

self.d_k = d_model // h

Looks that self.d_k = d_model // h ---> embed size 768 dividing number of heads 12 = 64

        self.d_k = d_model // h # 64
        self.h = h # 12 heads
        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
                             for l, x in zip(self.linear_layers, (query, key, value))]

why convert 768 dimensional [q,v,k] into 64 dimension embedding ?

Reference:
http://nlp.seas.harvard.edu/2018/04/03/attention.html
I put some comments on the shape:

class MultiHeadedAttention(nn.Module): # d_model=512, h=8
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h # 512//8=64
        self.h = h # 8
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k) # (nbatches, -1, 512)
        return self.linears[-1](x)

Why doesn't the counter in data_iter increase?

I am currently playing around with training and testing the model. However, as I implemented the test section, I'm noticing that during the LM training, your counter doesn't increase when looping over data_iter found in pretrain.py. This would cause problems when calculating the average loss/accuracy, wouldn't it?

Making Wikipedia Corpus

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

Attention maybe changed

Hi, Thanks for your great job.
I wonder that the attention mechanism of your code seems to be changed.
The shape of attention vector should be (batch, timestep, timestep), but according to your code, the shape of self attention vector is (batch, timestep, hidden_size). There is new code that I fixed below. Please review it and appreciate your comments. Thank you.

`
class Attention(nn.Module):
def init(self, num_hidden, h=8):
super(Attention, self).init()

    self.num_hidden_per_attn = num_hidden // h
    self.h = h
    
    self.key = nn.Linear(num_hidden, num_hidden)
    self.value = nn.Linear(num_hidden, num_hidden)
    self.query = nn.Linear(num_hidden, num_hidden)
    
    self.layer_norm_1 = LayerNorm(num_hidden)
    self.layer_norm_2 = LayerNorm(num_hidden)
    self.out_linear = nn.Linear(num_hidden, num_hidden)
    
    self.dropout = nn.Dropout(p=0.1)
    
def forward(self, input_):
    batch_size = input_.size(0)
    
    key = F.relu(self.key(input_))
    value = F.relu(self.value(input_))
    query = F.relu(self.query(input_))
    
    key, value, query = list(map(lambda x: x.view(batch_size, -1, self.h, self.num_hidden_per_attn), (key, value, query)))
    params = [(key[:,:,i,:], value[:,:,i,:], query[:,:,i,:]) for i in range(self.h)]

    _attn = list(map(self._multihead, params))
    attn = list(map(lambda x: x[0], _attn))
    probs = list(map(lambda x: x[1], _attn))
    result = t.cat(attn, -1)

    result = self.dropout(result)
    result = result.view(batch_size, -1, self.h * self.num_hidden_per_attn)
    
    # residual connection
    result = self.layer_norm_1(F.relu(input_ + result))
    
    out = self.out_linear(result)
    out = self.layer_norm_2(F.relu(result + out))
    
    return result, probs

def _multihead(self, params):

    key, value, query = params[0], params[1], params[2]

    attn = t.bmm(query, key.transpose(1,2)) / math.sqrt(key.shape[-1])

    attn = F.softmax(attn, dim=-1)
    result = t.bmm(attn, value)

    return result, attn

model/embedding/position.py

div_term = (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp()
should be:
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

In [51]: (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp()
...:
Out[51]:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Additional question:
I don't quite understand how "bidirectional" transformer in the raw paper implemented. Maybe like BiLSTM: concat two direction's transformer output together? Didn't find the similar structure in your code.

IndexError: list index out of range

readme has wrong commands

bert -c data/dataset.small -v data/vocab.small -o output/bert.model

should be

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

according to

bert-vocab -c data/corpus.small -o data/vocab.small

The LayerNorm implementation

I am wondering why don't you use the standard nn version of LayerNorm?
I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}

Could you clarify these 2 approaches?

how dose your code implement Bidirectional Transformers?

hi,
i am new user of pytorch, i want to know which part in your code can represent Bidirectional Transformers ? thanks .

The question about the implement of learning_rate

Nice implements! However, I have a question about learning rate. The learning_rate schedule which from the origin Transformers is warm-up restart, but your implement just simple decay. Could you implement it in your BERT code?

How to test your model after training on my own dataset?

I am using for next sent. gen so while it has stored model in .ep* format but how to run my test dataset using those models.

Thanks

will "random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)" report error?

in the dataset.py line 31, it seem to report error random.randint() needs two positional argument.

Erroneous code

BERT-pytorch/bert_pytorch/dataset/dataset.py

Lines 53 to 62 in 7efd2b5

 if prob < prob * 0.8: 

 tokens[i] = self.vocab.mask_index 

 # 10% randomly change token to random token 

 elif prob * 0.8 <= prob < prob * 0.9: 

 tokens[i] = random.randrange(len(self.vocab)) 

 # 10% randomly change token to current token 

 elif prob >= prob * 0.9: 

 tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

This code is incorrect - it will always go into the last if clause. For instance, prob < prob * 0.8 is never true.

EP_train:0: 0%|| 0/15636 [00:00<?, ?it/s]Segmentation fault (core dumped)

the data format is s, t, s_l, t_l, IsNext.

Bidirectional Encoder = Transformer (self-attention), Is it true？

https://github.com/codertimo/BERT-pytorch/blob/alpha0.0.1a4/bert_pytorch/model/transformer.py#L9

Thank you！

segmentation fault

When the code's Dataset class is modified, the error occurs. I try to debug the error, the code except at line: loss.backward().

PositionalEmbedding

The position embedding in the BERT is not the same as in the transformer. Why not use the form in bert?

chooses 15% of token

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure. Does it aligned with 15% of token will be chosen?

Question about the loss of Masked LM

Thank you very much for this great contribution.
I found the loss of masked LM didn't decrease when it reaches the value around 7. However, in the official tensorflow implementation, the loss of MLM decreases to 1 easily. I think something went wrong in your implementation.
In additional, I found the code can not predict the next sentence correctly. I think the reason is: self.criterion = nn.NLLLoss(ignore_index=0). It can not be used as criterion for sentence prediction because the label of sentence is 1 or 0. We should remove ignore_index=0 for sentence prediction.
I am looking forward to your reply~

What should be the shapes and example of values of x and segment_info in BERT.forward?

I'm trying to add BERT as trainable part to my model and want to pass some data to it.
Could you complete my code example with some x and segment_info?

from bert_pytorch import BERT
N = 30000
bert_model = BERT(N)
x = ...
segment_info = ...
bert_model.forward(x, segment_info)

what does \t be used for here

Tie the input and output embedding?

I think it's reasonable to tie the input and output embedding. Especially the output embedding along each token. But I still can't get a way to do this. Any one give an idea?

shape match for mul？

https://github.com/codertimo/BERT-pytorch/blob/alpha0.0.1a5/bert_pytorch/model/embedding/position.py#L18

What are the shapes of position and div_term?

when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

According to the code

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

Do we need to exclude the unmasked words when training the LM?

how to get dataset.small from corpus.small

in the 0th step , build a corpus named corpus.small,
in the 2th step, use a dataset.small ?
here is my question, build dataset.small from corpus.small or not ?

[BERT] Cannot import bert

I have problems importing bert when following http://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html

(mxnet_p36) [ec2-user@master ~]$ ipython
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import warnings
   ...: warnings.filterwarnings('ignore')
   ...:
   ...: import random
   ...: import numpy as np
   ...: import mxnet as mx
   ...: from mxnet import gluon
   ...: import gluonnlp as nlp
   ...:
   ...:


In [2]:

In [2]: np.random.seed(100)
   ...: random.seed(100)
   ...: mx.random.seed(10000)
   ...: ctx = mx.gpu(0)
   ...:
   ...:

In [3]: from bert import *
   ...:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-40b999f3ea6a> in <module>()
----> 1 from bert import *

ModuleNotFoundError: No module named 'bert'

Looks gluonnlp are successfully installed. Any idea?

(mxnet_p36) [ec2-user@master site-packages]$ ll /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg
-rw-rw-r-- 1 ec2-user ec2-user 499320 Dec 28 23:15 /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg

pred_loss decrease fast while avg_acc stay at 50%

I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.

imbalance GPU memory usage

Hi,

Nice try for BERT implementation.

I try to run your code in 4V100 and I find the memory usage is imbalance: the first GPU consume 2x memory than the others. Any idea about the reason?

Btw, I think the parameter order in train.py line 64 is incorrect.

Should random sampling work per every epoch?

Is is okay to use random sampled data which is saved before the training?
I mean, should it have to be changed every epoch?

Why output_label=0 in datasets generation for Masked LM

In dataset.py, function 'random_word', line90, why the output_label of 85% data(no masking) is set to 0 ， output_label.append(0)？

How does Next-Sentence-Prediction benefit to both QA and NLI?

The input for the one bidirectional transformer to pretrain is two sentences' concatenation.
I think the one bidirectional transformer is 'storing' the info of two sentences.
But in QA and NLI, we have two transformers and each transformer's input is one sentence.

Is there any result?

Is there any result that can be compared to the raw paper?Looking forward to your updates.

	prob = random.random()
	if prob < 0.15:
	# 80% randomly change token to make token
	if prob < prob * 0.8:
	tokens[i] = self.vocab.mask_index

	# 10% randomly change token to random token
	elif prob * 0.8 <= prob < prob * 0.9:
	tokens[i] = random.randrange(len(self.vocab))

	# 10% randomly change token to current token
	elif prob >= prob * 0.9:
	tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

	output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

codertimo / bert-pytorch Goto Github PK

bert-pytorch's Introduction

BERT-pytorch

Introduction

Installation

Quickstart

0. Prepare your corpus

1. Building vocab based on your corpus

2. Train your own BERT model

Language Model Pre-training

Masked Language Model

Rules:

Predict Next Sentence

Rules:

Author

License

bert-pytorch's People

Contributors

Stargazers

Watchers

Forkers

bert-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs