keon / seq2seq Goto Github PK

View Code? Open in Web Editor NEW

683.0 15.0 172.0 32 KB

Minimal Seq2Seq model with Attention for Neural Machine Translation in PyTorch

License: MIT License

Python 100.00%

seq2seq deep-learning machine-translation

seq2seq's Introduction

mini seq2seq

Minimal Seq2Seq model with attention for neural machine translation in PyTorch.

This implementation focuses on the following features:

Modular structure to be used in other projects
Minimal code for readability
Full utilization of batches and GPU.

This implementation relies on torchtext to minimize dataset management and preprocessing parts.

Model description

Encoder: Bidirectional GRU
Decoder: GRU with Attention Mechanism
Attention: Neural Machine Translation by Jointly Learning to Align and Translate

Requirements

GPU & CUDA
Python3
PyTorch
torchtext
Spacy
numpy
Visdom (optional)

download tokenizers by doing so:

python -m spacy download de
python -m spacy download en

References

Based on the following implementations

seq2seq's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 bityangke jdc08161063 liyc7711 gongqingyi-github wotulong sampathweb hedgefair kaixin-wu vincentzlt jkhlot herbertchen1 lgcming sankexin noveens tanyufei jasonlovescoding gwli lichenda zouxiaoyuonly chenwanheng xiedake afcarl shelleyhlx aresluo kimisissi lucadifast xkuang zhangxd12 ruizheliuoa cluluxiu yifdu pskrunner14 yucoian robspringles wuyushuang liu-yicheng tungk congson1293 truongvuxuan oscarli88 linao1996 yuyichen09 wp0517 miny0401 shuomei javelir amaankhan02 wiskia kywang dywlegend ygan amiyamandal-dev amiya-mandal learnerzhang neilteng shee11 blackboy5004 geektemo kaisar420 mrmrfan houchaoxu roholazandie quhanhao arnoldliulj agemagician rezaarmand divyeshrajpura4114 wendonggan shiqing1234 shenjiawei19 yanbin-wang zhshlii lzr9926 yingenglei kunlun-zhu prithivm sudipta90 rehan-ai bailianfa openstate boaz-yin a-little-story cesar456 richiesui superrichiesui roger-g ysraell duolinwang russul graph-star-team amitmy maridia philemon1991 kryptonrefugee coreper irtyamine xinhen maobui2907 yudhik11

seq2seq's Issues

About overfitting

I tried the code yesterday, after 100 epochs, the training error almost went down to zero, yet the test error is 7.23, rendering the model almost useless.
Early stop won't help, since val error never went below 4.
Any advice?

Use of non-linearity in calculating attention

Are you sure there is no tanh/relu in line 47 of seq2seq/model.py? If I understand it correctly, this is a one-layer NN, so shouldn't there be a non-linearity in between?

Hello,I have a question about the

torchtext Multi30k

when using the following method to create data
train, val, test = Multi30k.splits(exts=('.de', '.en'), fields=(DE, EN))
I got the following error message

//anaconda/lib/python3.5/site-packages/torchtext/datasets/translation.py in init(self, path, exts, fields, **kwargs)
31
32 examples = []
---> 33 with open(src_path) as src_file, open(trg_path) as trg_file:
34 for src_line, trg_line in zip(src_file, trg_file):
35 src_line, trg_line = src_line.strip(), trg_line.strip()

FileNotFoundError: [Errno 2] No such file or directory: '.data/val.de'

Do you have any idea on it?
Thank you in advance

Repo is broken

See comments on https://github.com/keon/seq2seq/pull/6/files#diff-93ff136fae812392eb0f68d1ce89b7feR43

A problem with loss computation.

loss = F.nll_loss(output[1:].view(-1, vocab_size), trg[1:].contiguous().view(-1), ignore_index=pad)

The loss computed by the above line is the average at every time step, which can cause it difficult to train the model.
So I suggest accumulating the loss at every time step. In my experiments, this makes it easier to train the model.

about the way to calculate attention weight

It seems that the way to calculate attention weight is different from origin paper: softmax(v* tanh(W*[s,h])), relu are used after softmax here, can you give some reasons or reference?

` def forward(self, hidden, encoder_outputs):
timestep = encoder_outputs.size(0)
h = hidden.repeat(timestep, 1, 1).transpose(0, 1)
encoder_outputs = encoder_outputs.transpose(0, 1) # [BTH]
attn_energies = self.score(h, encoder_outputs)
return F.relu(attn_energies).unsqueeze(1)

def score(self, hidden, encoder_outputs):
    # [B*T*2H]->[B*T*H]
    energy = F.softmax(self.attn(torch.cat([hidden, encoder_outputs], 2)), dim=2)
    energy = energy.transpose(1, 2)  # [B*H*T]
    v = self.v.repeat(encoder_outputs.size(0), 1).unsqueeze(1)  # [B*1*H]
    energy = torch.bmm(v, energy)  # [B*1*T]
    return energy.squeeze(1)  # [B*T]`

The Pytorch version？

What's the Pyrorch version used in the code?
0.3 or 0.4?

About the usage of initial hidden state in calculating attention

Hi, it is a super good implementation for seq2seq in pytorch, but I have a doubt in the following line:

seq2seq/model.py

Line 101 in 26ed9b4

hidden = hidden[:self.decoder.n_layers]

We can see from the code here, the hidden state (which is to calculate the attention) is the final hidden state from encoder, while in theory it should be the hidden state from decoder at current time step. Do you agree? Thx:-)

what is the nn.Parameter v for?

Why using relu to compute additaive attention

1、Attention's formula

In Normal Additive version, the attention score as follow:

score = v * tanh(W * [hidden; encoder_outputs])

In your code

score = v * relu(W * [hidden; encoder_outputs])

2、question

Is there some trick here? or this is a result after experimental comparision.

Model still uses teacher forcing when evaluating

Hi, I think there might be a bug in evaluate function from train.py; when the trained model is evaluated using the evaluate function, the 'model' still uses teacher forcing to evaluate the trained model. Therefore, there might be 50% probability to use true labels as an input for the hidden state at the next timestamp. I think this operation might be unsuitable. Because I think we always need to use the predicted labels as an input for the next hidden state to evaluate.

in model.py line 76: context = attn_weights.bmm(encoder_outputs.transpose(0, 1)) # (B,1,N)

Is this correct? I look up the explanation about bmm in official document,it syas that
batch1 and batch2 must be 3-D tensors each containing the same number of matrices.
but as you defined before, the attn_weights is a 2-D shape ,I think here may be some mistakes

A question about the nn.Embedding

Thank you for sharing this project code, and I have a question for nn.Embedding.

In this project, the shape of src and trg is (maxLen, batch size). The forward of Encoder is:

    def forward(self, src, hidden=None):
        embedded = self.embed(src)
        outputs, hidden = self.gru(embedded, hidden)
        # sum bidirectional outputs
        outputs = (outputs[:, :, :self.hidden_size] +
                   outputs[:, :, self.hidden_size:])
        return outputs, hidden

When I debug it, the shape of src is (37, 32), in which 32 is the batch size.
However, when I read the explanation of nn.Embedding, the example code shows:

>>> # a batch of 2 samples of 4 indices each
>>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
>>> embedding(input)

Thus, the input of Embedding should be (batch size, maxLen).

This problem make me very confuzed.

Any suggestion is apprciated!

What's the exact Pytorch and Torchtext version for your code? I am trying to downgrade to a previous version in order to avoid the Multi30k.split() problem but failed.

Hello,I have a question about the "encoder_outputs"of the model.py

what is the "encoder_outputs",would you tell me more specific

A bug of the Loss function in the def 'train' and 'evaluate'

Hi, I found the loss function in the def 'train' and 'evaluate' is cross-entropy. But in the model.py, the output from the decoder is operated by a log-softmax function. According to the definition of cross_entropy, the log_softmax operation has been included in the cross_entropy. I think the loss function in 'train' and 'evaluate' might be nll-loss. Then the entire loss function containing final operation of the decoder and nll_loss is the cross_entropy.

[EROOR] Not Work Relu

The following error occurs when executing "train.py".

TypeError: relu() got an unexpected keyword argument 'dim'

I checked the official documentation of pyTorch and it was not necessary to specify "dim" in any version.

don't have the inference mode?

Does this model include the inference mode? All I can see in function"forward" requires the target sentence?

random seed

Maybe the demo don`t fix the random seed, lead to log is different when run every time.