graykode / nlp-tutorial Goto Github PK

View Code? Open in Web Editor NEW

13.7K 290.0 3.9K 361 KB

Natural Language Processing Tutorial for Deep Learning Researchers

Home Page: https://www.reddit.com/r/MachineLearning/comments/amfinl/project_nlptutoral_repository_who_is_studying/

License: MIT License

Python 38.97% Jupyter Notebook 61.03%

nlp natural-language-processing tutorial pytorch tensorflow transformer attention paper bert

nlp-tutorial's Introduction

nlp-tutorial

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank lines)

[08-14-2020] Old TensorFlow v1 code is archived in the archive folder. For beginner readability, only pytorch version 1.0 or higher is supported.

Curriculum - (Example Purpose)

1. Basic Embedding Model

1-1. NNLM(Neural Network Language Model) - Predict Next Word
- Paper - A Neural Probabilistic Language Model(2003)
- Colab - NNLM.ipynb
1-2. Word2Vec(Skip-gram) - Embedding Words and Show Graph
- Paper - Distributed Representations of Words and Phrases and their Compositionality(2013)
- Colab - Word2Vec.ipynb
1-3. FastText(Application Level) - Sentence Classification
- Paper - Bag of Tricks for Efficient Text Classification(2016)
- Colab - FastText.ipynb

2. CNN(Convolutional Neural Network)

2-1. TextCNN - Binary Sentiment Classification
- Paper - Convolutional Neural Networks for Sentence Classification(2014)
- TextCNN.ipynb

3. RNN(Recurrent Neural Network)

3-1. TextRNN - Predict Next Step
- Paper - Finding Structure in Time(1990)
- Colab - TextRNN.ipynb
3-2. TextLSTM - Autocomplete
- Paper - LONG SHORT-TERM MEMORY(1997)
- Colab - TextLSTM.ipynb
3-3. Bi-LSTM - Predict Next Word in Long Sentence
- Colab - Bi_LSTM.ipynb

4. Attention Mechanism

4-1. Seq2Seq - Change Word
- Paper - Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation(2014)
- Colab - Seq2Seq.ipynb
4-2. Seq2Seq with Attention - Translate
- Paper - Neural Machine Translation by Jointly Learning to Align and Translate(2014)
- Colab - Seq2Seq(Attention).ipynb
4-3. Bi-LSTM with Attention - Binary Sentiment Classification
- Colab - Bi_LSTM(Attention).ipynb

5. Model based on Transformer

5-1. The Transformer - Translate
- Paper - Attention Is All You Need(2017)
- Colab - Transformer.ipynb, Transformer(Greedy_decoder).ipynb
5-2. BERT - Classification Next Sentence & Predict Masked Tokens
- Paper - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(2018)
- Colab - BERT.ipynb

Dependencies

Python 3.5+
Pytorch 1.0.0+

Author

Tae Hwan Jung(Jeff Jung) @graykode
Author Email : [email protected]
Acknowledgements to mojitok as NLP Research Internship.

nlp-tutorial's People

Stargazers

Watchers

Forkers

wonmin1 jelitox ssghost mysticsoul hypathia may22es rrmina chubbymaggie zzzz123321 chenhh0 fudp compgenomics sunyssc weiwu-csu sa757 kelvict namja zlg358 daehwanahn zblyou1 haoday yecgulu fuzctc charlottesean dchen40 zhangjiekui robinhsu121 shaunstanislauslau fishredleaf kimichang cotitan zyxpaidaxing youngshuohao ilovenlp zzq123686 techeye220 genesisxyl michael-ji lianggaoquan yushu-liu jeremyzhang866 ofllm wll199566 sagar19raorane alabarga kasapu qicst23 lovit polychlorobenzene dmmiller612 wurikiji whungt add-ice linpengchao ibelievecjm jiamim augustpanda mbrukman songxianjin time-to-learn-ml vijaynitrr yifdu en574894764 pkumaplee yunweidashuju azurathena ilgh hack121 0xflotus ouyangliqi nishalpattan ivanaedo zhaoxy92 shashgpt pwforks newenglandml tumkit15 dcom-khu dsp6414 stalinkush mehrdad-shokri zhouchangsjtu yinjie1230 saneshashank jiangzhongkai tomarraj008 sainiudit allensmile vangogh0318 wisdomdeng amrfaissal falconzyx rahulsoibam mkirmse gotobelieve beesitech githubwanghc psunustc 0r0i shamrock222

nlp-tutorial's Issues

When to release Keras version？

LongTensor error dim in BiLSTM Attention with new data

I try with new data includes: 8 class
ValueError: expected sequence of length 2681 at dim 1 (got 2249)

seq2seq_torch maybe have a small mistake

# output : [max_len+1, batch_size, num_directions(=1) * n_hidden]
    output = output.transpose(0, 1) # [batch_size, max_len+1(=6), num_directions(=1) * n_hidden]

# output : [max_len+1, batch_size, n_class]
    output = output.transpose(0, 1) # [batch_size, max_len+1(=6), n_class]

no embedding in nnlm?

I can not find the embedding in nnlm!

Why is src_len+1 in Transformer demo?

self.pos_emb = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(src_len+1, d_model),freeze=True)

The position encoding table should be (max_len, d_model), why add 1?

Why not set batch_first = True in the LSTM model(pytorch)?

May be a small mistake

Excuse me, https://github.com/graykode/nlp-tutorial/blob/master/1-1.NNLM/NNLM-Torch.py#L50 The comment here may be wrong. It should be X = X.view(-1, n_step * m) # [batch_size, n_step * m]

Sorry for disturbing you.

The Adam in 5-1.Transformer should be replaced by SGD

Line 202 :
optimizer = optim.Adam(model.parameters(), lr=0.001)

In practice, I think the effect of Adam is quite bad. When epoch = 10, cost is 1.6; when epoch = 100 or 1000, cost is still equal to 1.6.
So I think we can change Adam to SGD, that is, optimizer = optim.SGD(model.parameters(), lr=0.001)

Here are the effects of using SGD：

Epoch: 0100 cost = 0.047965
Epoch: 0200 cost = 0.020129
Epoch: 0300 cost = 0.012563
Epoch: 0400 cost = 0.009101
Epoch: 0500 cost = 0.007131
Epoch: 0600 cost = 0.005862
Epoch: 0700 cost = 0.004978
Epoch: 0800 cost = 0.004325
Epoch: 0900 cost = 0.003823
Epoch: 1000 cost = 0.003426

Why is first parameter `src_vocab_size`?

nlp-tutorial/5-1.Transformer/Transformer(Greedy_decoder)-Torch.py

Line 141 in 3b3a80d

 self.pos_emb = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(src_vocab_size, d_model),freeze=True) 

position encoding table should be (src_len, d_model). Why (src_vocab_size, d_model) here?

Seq2Seq(Attention)Input Shape Question

Seq2Seq(Attention)\Seq2Seq(Attention)-Tensor.py

The shape of the input should be [max_time, batch_size,...]. The input = tf. transpose (dec_inputs, [1, 0, 2]) has already been transformed. In tf. expand_dims (inputs [i], 1), the expansion is indeed one dimension. It seems that there should be zero dimension expansion here. Although the final shape is correct, whether it is intentional or not is here. What about a little trick?

CODE

In code 4-1.Seq2Seq might have wrong section

At the function translate(in line 90), there's no pre defined object 'args'.
And the function make_batch has no expected args but '[[word, 'P' * len(word)]], args' are given

so, I think the code should be modified.

from

    def translate(word, args):
        input_batch, output_batch, _ = make_batch([[word, 'P' * len(word)]], args)

        # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
        hidden = torch.zeros(1, 1, args.n_hidden)
        output = model(input_batch, hidden, output_batch)
        # output : [max_len+1(=6), batch_size(=1), n_class]

        predict = output.data.max(2, keepdim=True)[1] # select n_class dimension
        decoded = [char_arr[i] for i in predict]
        end = decoded.index('E')
        translated = ''.join(decoded[:end])

        return translated.replace('P', '')

# Test
    def translate(word):
        input_batch, output_batch = make_testbatch(word)

        # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
        hidden = torch.zeros(1, 1, n_hidden)
        output = model(input_batch, hidden, output_batch)
        # output : [max_len+1(=6), batch_size(=1), n_class]

        predict = output.data.max(2, keepdim=True)[1] # select n_class dimension
        decoded = [char_arr[i] for i in predict]
        end = decoded.index('E')
        translated = ''.join(decoded[:end])

        return translated.replace('P', '')

and make_testbatch should pre declared

#make test batch
def make_testbatch(input_word):
    input_batch, output_batch = [], []

    input_w = input_word + 'P' * (n_step - len(input_word))
    input = [num_dic[n] for n in input_w]
    
    #make a sequence with just start token(S) and pad tokens(P)
    output = [num_dic[n] for n in 'S' + 'P' * n_step]

    input_batch = np.eye(n_class)[input]
    output_batch = np.eye(n_class)[output]

    return torch.FloatTensor(input_batch).unsqueeze(0), torch.FloatTensor(output_batch).unsqueeze(0)

Thank you

TextCNN-Tensor.py

Hello,

I think there is a problem with this file, it is the same file as TextCNN-Torch.py.

I guess it should be the version with Tensorflow?

Thanks anyway for this repo

Some problems about Bert

line 70: index = randint(0, vocab_size - 1) # random index in vocabulary.
I think the replace index can't involve 'cls' ,'sep' and 'mask'!

Seq2Seq pytorch

Hi
thanks for sharing your codes.

I've had read your seq2seq implementation and I was wondering about the RNN Encode-Decode model.

in the paper, 'Learning Phrase Representations using RNN Encoder–Decoder
for Statistical Machine Translation'

They say

proposed gating unit

and I couldn't find the new hidden-state activation function in your code.

Do you have any plan to add the proposed activation process?
or is it okay to just skip the parts?

thank you so much in advance

Version 2.0 will be updated

Hello. It's been about 2 years since the repository started, and thank you for your interest.
Most of them are now written in legacy code that is not used in pytorch or tensorflow, so we want to update to a new version.

Pytorch higher than 1.0.0

There is no plan to support tensorflow v2 because the python-like pytorch is more readable for beginners.
In addition, the philosophy of pytorch and tensorflow is very different, and good code cannot be produced by trying to implement them similarly.

Therefore, the existing tensorflow v1 related code will be archived in a new folder.

the writter's Transformer torch.py 's position embed has some mistake.

i fixed it in the following
https://github.com/zhangbo2008/nlp-tutorial/blob/master/2cpu_Input_myData.py
https://github.com/zhangbo2008/nlp-tutorial/blob/master/2gpu_Input_myData.py
if you find some mistake, please comment me

Which kind of model is better for keyword-set classification?

There exists a similar task that is named text classification.

But I want to find a kind of model that the inputs are keyword set. And the keyword set is not from a sentence.

For example:

input ["apple", "pear", "water melon"] --> target class "fruit"
input ["tomato", "potato"] --> target class "vegetable"

Another example:

input ["apple", "Peking", "in summer"]  -->  target class "Chinese fruit"
input ["tomato", "New York", "in winter"]  -->  target class "American vegetable"
input ["apple", "Peking", "in winter"]  -->  target class "Chinese fruit"
input ["tomato", "Peking", "in winter"]  -->  target class "Chinese vegetable"

Thank you.

Transformer/Transformer(Greedy_decoder)-Torch.py on gpu

Hello, I want to put the Transformer (Greedy_decoder)-Torch.py code on the gpu, using model=model.to(device), input_data also to (device), but the error still appears "Expected object of backend CUDA but backend CPU for argument #2 'mat2”

about seq2seq(attention)-Torch multiple sample training question

hello, first thank your code, but i want to know if batch_size is more than 1, i should how to modify the code, thank you

    def get_att_weight(self, output, enc_output):  # get attention weight one 'output' with 'enc_output'
        '''
        output: [1, batch_size, num_directions(=1) * n_hidden]
        enc_output: [n_step+1, batch_size, num_directions(=1) * n_hidden]
        '''
        length = len(enc_output)
        attn_scores = torch.zeros(length)  # attn_scores : [batch_size, n_step+1]
        for i in range(length):
            attn_scores[i] = self.get_att_score(output, enc_output[i])

        # Normalize scores to weights in range 0 to 1
        # return [batch_size, 1, n_step+1]
        return F.softmax(attn_scores).view(batch_size, 1, -1)

    def get_att_score(self, output, enc_output):
        '''
        output: [batch_size, num_directions(=1) * n_hidden]
        enc_output: [batch_size, num_directions(=1) * n_hidden]
        '''
        score = self.attn(enc_output)  # score : [1, n_hidden]
        return torch.dot(output.view(-1), score.view(-1))  # inner product make scalar value, get a real number

TextCNN_Torch have wrong comment

def forward(self, X): embedded_chars = self.W[X] # [batch_size, sequence_length, sequence_length]

I think the shape is [batch_size, sequence_length,embedding_size]

A questions about decoder in seq2seq-torch

nlp-tutorial/4-1.Seq2Seq/Seq2Seq-Torch.py

Line 92 in 6e171b9

input_batch, output_batch, _ = make_batch([[word, 'P' * len(word)]])

Hi,I‘m a nlp rookie.I want to ask you a question.I read the seq2seq's paper,which use t-1 output as the t input in decoder. Your code in this line use 'SPPPPP' as the decoder input.So,is this way harm to the result?
If you see this issues, please answer me in your free time.
Although my english is poor, I still want to express my gratitude to you.

A questions about seq2seq-torch.py in the 43th line

Hi, Im an nlp rookie, I want to ask u a question, your code extract input(context) in a fixed window in 43th area, and "word sequence" is a sentences list , some words may extract their neighbour words form different sentences, so, is this way harm to the result?

And my training result seems not very well and I didn't change the codes.

If u see this issues, please answer me in your free time.
Although my english is poor, I still want to express my gratitude to u.

how to use seq2seq(attention) for multiple batch

A question about seq2seq-torch.py in the 74th line

I think there is a small problem in the 74th line of the "seq2seq-torch.py",the dimension of input_batch and output_batch is not [batch_size, max_len, n_hidden] but [batch_size, max_len, n_class]. Or I don't fully understand your code：），please help me，thx~

If create the Linear() casually, it won't be trained during training.

nlp-tutorial/5-1.Transformer/Transformer(Greedy_decoder)-Torch.py

Line 98 in 3b3a80d

output = nn.Linear(n_heads * d_v, d_model)(context)

A question about seq2seq with attention

hi, I have a question about the way to calculate attention_weights.
https://github.com/graykode/nlp-tutorial/blob/master/4-2.Seq2Seq(Attention)/Seq2Seq(Attention)-Torch.py

in line 60, the attn_weights is calculated by dec_output and enc_outputs in your code, why not dec_hidden and enc_hidden?

Question about tensor.view operation in Bi-LSTM(Attention)

nlp-tutorial/4-3.Bi-LSTM(Attention)/Bi-LSTM(Attention)-Torch.py

Line 50 in cb4881e

 hidden = final_state.view(-1, n_hidden * 2, 1) # hidden : [batch_size, n_hidden * num_directions(=2), 1(=n_layer)] 

Hi, this repo is awesome, but there might be something wrong in the code above. According to the comment above, this snippet intends to change a tensor from shape [num_layers(=1) * num_directions(=2), batch_size, n_hidden] to shape [batch_size, n_hidden * num_directions(=2), 1(=n_layer)], i.e. to concatenate the 2 hidden vector from different direction for every data example in a batch(By saying "data example", I mean a batch has batch_size examples). But I think the code above will mess up the data examples in a batch and lead to unexpected result.

For example, we can use IPython to check the effect of the snippet above.

# create a tensor with shape [num_layers(=1) * num_directions(=2), batch_size, n_hidden]                                                                                           
In [10]: a=torch.arange(2*3*5).reshape(2,3,5) 
                                                                       
In [11]: a                                                             
Out[11]:                                                               
tensor([[[ 0,  1,  2,  3,  4],                                         
         [ 5,  6,  7,  8,  9],                                         
         [10, 11, 12, 13, 14]],                                        
                                                                       
        [[15, 16, 17, 18, 19],                                         
         [20, 21, 22, 23, 24],                                         
         [25, 26, 27, 28, 29]]])                                       
                                                                       
In [12]: a.view(-1,10,1)                                               
Out[12]:                                                               
tensor([[[ 0],                                                         
         [ 1],                                                         
         [ 2],                                                         
         [ 3],                                                         
         [ 4],                                                         
         [ 5],                                                         
         [ 6],                                                         
         [ 7],                                                         
         [ 8],                                                         
         [ 9]],                                                        
                                                                       
        [[10],                                                         
         [11],                                                         
         [12],                                                         
         [13],                                                         
         [14],                                                         
         [15],                                                         
         [16],                                                         
         [17],                                                         
         [18],                                                         
         [19]],                                                        
                                                                       
        [[20],                                                         
         [21],                                                         
         [22],                                                         
         [23],                                                         
         [24],                                                         
         [25],                                                         
         [26],                                                         
         [27],                                                         
         [28],                                                         
         [29]]])

As you can see, we create a tensor with batch_size=3 and n_hidden=5, e.g [ 0, 1, 2, 3, 4] and [15, 16, 17, 18, 19] belong to the same data example in the batch, but they are from different directions, so what we want is to concatenate them in the resulting tensor. But what the code really does is to concatenate [ 0, 1, 2, 3, 4] and [ 5, 6, 7, 8, 9], which are from different data examples in a batch.

I think it can be fixed by changing the line of code to hidden=torch.cat(final_state[0],final_state[1]],1).view(-1,10,1)

The effect of the new code can be shown as follows:

In [13]: torch.cat([a[0],a[1]],1).view(-1,10,1)
Out[13]:
tensor([[[ 0],
         [ 1],
         [ 2],
         [ 3],
         [ 4],
         [15],
         [16],
         [17],
         [18],
         [19]],

        [[ 5],
         [ 6],
         [ 7],
         [ 8],
         [ 9],
         [20],
         [21],
         [22],
         [23],
         [24]],

        [[10],
         [11],
         [12],
         [13],
         [14],
         [25],
         [26],
         [27],
         [28],
         [29]]])

A question about Autocomplete LSTM Tensorflow

In Autocomplete We already have

X = tf.placeholder(tf.float32, [None, n_step, n_class]) # [batch_size, n_step, n_class]
Y = tf.placeholder(tf.float32, [None, n_class])

to guess next missing character

How I can customize them to guess more than 1 character ? I don't have any idea about multiplies a tensor by tensor.
In outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
Why the shape of states alway (2,), what really 2 mean ?
Thank you for sharing the information.

doubts about the TextCNN Code

`class TextCNN(nn.Module):
def init(self):
super(TextCNN, self).init()

    self.num_filters_total = num_filters * len(filter_sizes)
    self.W = nn.Parameter(torch.empty(vocab_size, embedding_size).uniform_(-1, 1)).type(dtype)
    self.Weight = nn.Parameter(torch.empty(self.num_filters_total, num_classes).uniform_(-1, 1)).type(dtype)
    self.Bias = nn.Parameter(0.1 * torch.ones([num_classes])).type(dtype)

def forward(self, X):
    embedded_chars = self.W[X] # [batch_size, sequence_length, sequence_length]
    embedded_chars = embedded_chars.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]

    pooled_outputs = []
    for filter_size in filter_sizes:
        # conv : [input_channel(=1), output_channel(=3), (filter_height, filter_width), bias_option]
        conv = nn.Conv2d(1, num_filters, (filter_size, embedding_size), bias=True)(embedded_chars)
        h = F.relu(conv)
        # mp : ((filter_height, filter_width))
        mp = nn.MaxPool2d((sequence_length - filter_size + 1, 1))
        # pooled : [batch_size(=6), output_height(=1), output_width(=1), output_channel(=3)]
        pooled = mp(h).permute(0, 3, 2, 1)
        pooled_outputs.append(pooled)

    h_pool = torch.cat(pooled_outputs, len(filter_sizes)) # [batch_size(=6), output_height(=1), output_width(=1), output_channel(=3) * 3]
    h_pool_flat = torch.reshape(h_pool, [-1, self.num_filters_total]) # [batch_size(=6), output_height * output_width * (output_channel * 3)]

    model = torch.mm(h_pool_flat, self.Weight) + self.Bias # [batch_size, num_classes]
    return model`

I wonder if it's wrong to create conv inside the loop?

Some mistake in Transformer Position Encoding & BERT

1. mistake in Transformer

# Padding Should be Zero
src_vocab = {'P' : 0, 'ich' : 1, 'mochte' : 2, 'ein' : 3, 'bier' : 4}
src_vocab_size = len(src_vocab)

tgt_vocab = {'P' : 0, 'i' : 1, 'want' : 2, 'a' : 3, 'beer' : 4, 'S' : 5, 'E' : 6}
number_dict = {i: w for i, w in enumerate(tgt_vocab)}
tgt_vocab_size = len(tgt_vocab)

I changed my code more clearly.
There are some mis-points in Transformer about Position Encoding, beacause of torch.LongTensor([[1,2,3,4,5]]) that the indexing of Embedding is a mixed issue.

So I fixed right with shape of get_sinusoid_encoding_table.
In Encoder, self.pos_emb(torch.LongTensor([[5,1,2,3,4]])) is right as ich mochte ein bier P and Decoder, self.pos_emb(torch.LongTensor([[5,1,2,3,4]])) is right as S i want a beer

2. Too heavy BERT as tutorial

In original paper, maxlen is 512, n_layer(number of layers) are 12, but in this tutorial, that is too heavy to run,, so I fiex below this.

# BERT Parameters
maxlen = 30
batch_size = 6
max_pred = 5 # max tokens of prediction
n_layers = 6
n_heads = 12
d_model = 768
d_ff = 768*4 # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2

Also other implementation repository about BERT, when pre processing about masking, [CLS], [SEP], [PAD] should not to be changed as MASK

cand_maked_pos = [i for i, token in enumerate(input_ids)] # wrong this.

https://github.com/dhlee347/pytorchic-bert/blob/master/pretrain.py#L132 this code is right, so i fixed it.

Then, I added SEGMENT MASK for masking where token is zero padding.
This is very import problem.

BERT-Torch.py may have a small mistake

line 69-70 :
index = randint(0, vocab_size - 1) # random index in vocabulary
input_ids[pos] = word_dict[number_dict[index]]
The length of number_dict is 25, but the length of vocab_size is 29, so number_dict[index] might be out of range.
May be we should change line 69 into index = randint(0, len(word_list) - 1)?

3-3-bilstm-torch comment error

class BiLSTM(nn.Module):
def init(self):
super(BiLSTM, self).init()

    self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
    self.W = nn.Parameter(torch.randn([n_hidden * 2, n_class]).type(dtype))
    self.b = nn.Parameter(torch.randn([n_class]).type(dtype))

def forward(self, X):
    input = X.transpose(0, 1)  # input : [n_step, batch_size, n_class]

    hidden_state = Variable(torch.zeros(1*2, len(X), n_hidden))   # [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
    cell_state = Variable(torch.zeros(1*2, len(X), n_hidden))     # [num_layers(=1) * num_directions(=1), batch_size, n_hidden]

    outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))
    **outputs = outputs[-1]  # [batch_size, n_hidden]**
    model = torch.mm(outputs, self.W) + self.b  # model : [batch_size, n_class]
    return model

error: "outputs = outputs[-1] # [batch_size, n_hidden]"
the shape should be [batch_size,2*n_hidden]

A question about transformer

about skip-gram code

I don't quite understand that why 'batch_inputs', 'batch_labels' should be updated during each loop in Word2Vec-Skipgram-Tensor(Softmax).py .

Also ,what does 'trained_embeddings = W.eval()' mean?

Could you explain it for me?I am a bit confused.

`# code

for epoch in range(5000):
    batch_inputs, batch_labels = random_batch(skip_grams, batch_size)
    _, loss = sess.run([optimizer, cost], feed_dict={inputs: batch_inputs, labels: batch_labels})

    if (epoch + 1)%1000 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    trained_embeddings = W.eval()`

Bi-LSTM attention calc may be wrong

lstm_output : [batch_size, n_step, n_hidden * num_directions(=2)], F matrix

def attention_net(self, lstm_output, final_state): 
    batch_size = len(lstm_output) 
    hidden_forward=final_state[0] 
    hidden_backward=final_state[1]
    hidden_f_b=torch.cat((hidden_forward, hidden_backward), 1) 
    hidden = hidden_f_b.view(batch_size, -1, 1)   #  
    hidden = final_state.view(batch_size, -1, 1)   # this line in source code is wrong, bi-lstm's hidden is[2,batch,embed_size] ,we need to concatenate forward and backward hidden state. if we   final_state.view(batch_size, -1, 1)   the  hidden state is not concatenate by final_state[0][0] and final_state[1][0]

a question about transformer

class MultiHeadAttention(nn.Module):
def init(self):
super(MultiHeadAttention, self).init()
self.W_Q = nn.Linear(d_model, d_k * n_heads)
self.W_K = nn.Linear(d_model, d_k * n_heads)
self.W_V = nn.Linear(d_model, d_v * n_heads)
def forward(self, Q, K, V, attn_mask):
# q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
residual, batch_size = Q, Q.size(0)
# (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2) # q_s: [batch_size x n_heads x len_q x d_k]
k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2) # k_s: [batch_size x n_heads x len_k x d_k]
v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2) # v_s: [batch_size x n_heads x len_k x d_v]

    attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size x n_heads x len_q x len_k]

    # context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
    context, attn = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)
    context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size x len_q x n_heads * d_v]
    output = nn.Linear(n_heads * d_v, d_model)(context)
    return nn.LayerNorm(d_model)(output + residual), attn # output: [batch_size x len_q x d_model]

the last second line instantiates a class every time , is it right ? the class should be instantiate in the init function ??

what about adding the recent model OpenAI GPT-2 ?

Since the recent model OpenAI GPT-2 may be better than BERT and it would be an important breakthrough in NLP, why not adding its' intuitive and simplest code in this repo ?

5.1 Transformer may have wrong position embed

in"class Encoder": enc_outputs = self.src_emb(enc_inputs) + self.pos_emb(torch.LongTensor([[1,2,3,4,0]]))

I think it may be: enc_outputs = self.src_emb(enc_inputs) + self.pos_emb(torch.LongTensor([[0,1,2,3,4]]))

in"class Decoder": dec_outputs = self.tgt_emb(dec_inputs) + self.pos_emb(torch.LongTensor([[5,1,2,3,4]]))

I think it may be: dec_outputs = self.tgt_emb(dec_inputs) + self.pos_emb(torch.LongTensor([[0,1,2,3,4]]))

link of NNLM and word2vec is disabled

Colab links of NNLM and word2vec is wrong.
404

Question?

Did this repository is not supported anymore?

seq2seq(attention) have wrong comment

context = tf.matmul(attn_weights, enc_outputs)
dec_output = tf.squeeze(dec_output, 0)  # [1, n_step]
context = tf.squeeze(context, 1)  # [1, n_hidden]

I think dec_output shape is [1,n_hidden]

BiLstm(tf) maybe have mistake

calculate attention_score
`

Attention

outputs = tf.concat([output[0], output[1]], 2) # output[0] : lstm_fw, output[1] : lstm_bw
outputs = tf.transpose(outputs, [1, 0, 2]) # [n_step, batch_size, n_hidden]

只用了最后一个步长的输出

final_hidden_state = outputs[-1]
output_all = tf.concat([output[0], output[1]], 2)
final_hidden_state = tf.expand_dims(final_hidden_state, 2)
attn_weights = tf.squeeze(tf.matmul(output_all, final_hidden_state), 2) `

Problem with BERT batch generation

There is a problem with padding on line 73-75 . What if the sentence length is larger than maxlen? Then we end up with sequences of varying length and line 214 throws an error.

Create Google Colab Link for beginner

I got a advice from Reddit Page from mslavescu about creating Google Colab and linking Pages directly in GitHub Readme.

3-3.Bi-LSTM may have wrong padding

In line 16 you use
input = input + [0] * (max_len - len(input))
the padding, you use 0, which means the first word 'Lorem'.
but it is not the right choose.
I think you can change like that

    # word_dict = {w: i for i, w in enumerate(list(set(sentence.split())))}
    # number_dict = {i: w for i, w in enumerate(list(set(sentence.split())))}
    word_dict = {w: i for i, w in enumerate(['PAD']+list(set(sentence.split())))}
    number_dict = {i: w for i, w in enumerate(['PAD']+list(set(sentence.split())))}

Faster attention calculation in 4-2.Seq2Seq?

Thanks for sharing! Just found out Attention.get_att_weight is calculating attention in a for-loop? this looks rather slow isn't it?

4-2.Seq2Seq(Attention)/Seq2Seq(Attention).ipynb

    def get_att_weight(self, dec_output, enc_outputs):  # get attention weight one 'dec_output' with 'enc_outputs'
        n_step = len(enc_outputs)
        attn_scores = torch.zeros(n_step)  # attn_scores : [n_step]

        for i in range(n_step):
            attn_scores[i] = self.get_att_score(dec_output, enc_outputs[i])

        # Normalize scores to weights in range 0 to 1
        return F.softmax(attn_scores).view(1, 1, -1)

    def get_att_score(self, dec_output, enc_output):  # enc_outputs [batch_size, num_directions(=1) * n_hidden]
        score = self.attn(enc_output)  # score : [batch_size, n_hidden]
        return torch.dot(dec_output.view(-1), score.view(-1))  # inner product make scalar value

Suggested parallel version

    def get_att_weight(self, dec_output, enc_outputs):  # get attention weight one 'dec_output' with 'enc_outputs'
        n_step = len(enc_outputs)
        attn_scores = torch.zeros(n_step,device=self.device)  # attn_scores : [n_step]

        enc_t = self.attn(enc_outputs)
        score = dec_output.transpose(1,0).bmm(enc_t.transpose(1,0).transpose(2,1))
        out1   = score.softmax(-1)
        return out1

About make_batch of NNLM

input = [word_dict[n] for n in word[:-1]]  # create (1~n-1) as input
target = [word_dict[word[-1]]]

it constraints the length of input and n_step. I think the following example is even better

    for i in range(len(words) - window_size + 1):`
       ` x_train.append(words[i: i + window_size - 1])`
        `y_train.append(words[i + window_size - 1])

Attention BiLSTM

How is it possible to use the Attention Layer in (4.3) for sequence-to-sequence classification something like Named Entity Recognition or Semantic Role Labeling?

different Embedding way

In the code 'Seq2seq-torch.py', i saw u use np.eye,the one-hot representation, to represent embedding, so i change in a normal way ,using nn.Embedding(dict_length,embedding_dim),it can work out. but the loss i got is very high.
i wanna ask the differences between this two ways. here are my code and the result.