andreamad8 / universal-transformer-pytorch Goto Github PK

Implementation of Universal Transformer in Pytorch

Python 100.00%

universal-transformer-pytorch's Introduction

Universal-Transformer-Pytorch

Simple and self-contained implementation of the Universal Transformer (Dehghani, 2018) in Pytorch. Please open issues if you find bugs, and send pull request if you want to contribuite.

GIF taken from: https://twitter.com/OriolVinyalsML/status/1017523208059260929

Universal Transformer

The basic Transformer model has been taken from https://github.com/kolloldas/torchnlp. For now it has been implemented:

Universal Transformer Encoder Decoder, with position and time embeddings.
Adaptive Computation Time (Graves, 2016) as describe in Universal Transformer paper.
Universal Transformer for bAbI data.

Dependendency

python3
pytorch 0.4
torchtext
argparse

How to run

To run standard Universal Transformer on bAbI run:

python main.py --task 1

To run Adaptive Computation Time:

python main.py --task 1 --act

Results

10k over 10 run, get the maximum.

In task 16 17 18 19 I notice that are very hard to converge also in training set. The problem seams to be the lr rate scheduling. Moreover, on 1K setting the results are very bad yet, maybe I have to tune some hyper-parameters.

Task	Uni-Trs	+ ACT	Original
1	0.0	0.0	0.0
2	0.0	0.2	0.0
3	0.8	2.4	0.4
4	0.0	0.0	0.0
5	0.4	0.1	0.0
6	0.0	0.0	0.0
7	0.4	0.0	0.0
8	0.2	0.1	0.0
9	0.0	0.0	0.0
10	0.0	0.0	0.0
11	0.0	0.0	0.0
12	0.0	0.0	0.0
13	0.0	0.0	0.0
14	0.0	0.0	0.0
15	0.0	0.0	0.0
16	50.5	50.6	0.4
17	13.7	14.1	0.6
18	4	6.9	0.0
19	79.2	65.2	2.8
20	0.0	0.0	0.0
---	---	---	---
avg	7.46	6.98	0.21
fail	3	3	0

TODO

Visualize ACT on different tasks

universal-transformer-pytorch's People

Contributors

Stargazers

Watchers

universal-transformer-pytorch's Issues

ImportError: cannot import name 'babi' from 'torchtext.data.metrics'

Name: torchtext
Version: 0.10.0

Unable to reproduce results (tested on Task 1 & 2)

Hi,

I ran the experiments on the 10K setting, but my results are way worse than the reported ones.
I didn't change any of the default parameters except from setting the tenK param in main.py, line 64 to True. Then I ran python main.py --act --verbose --cuda.

There are no errors and the results from 10 runs are:
Task 1
Noam False ACT True Task: 1 Max: 0.492 Mean: 0.42350000000000004 Std: 0.0808062497582953
Task 2
Noam False ACT True Task: 2 Max: 0.323 Mean: 0.26880000000000004 Std: 0.04480580319556831

I have not tried the other tasks (but at least task 3 seems to be the same) as something seems to be going wrong generally. The results are equal in a non-cuda setup and worse without act enabled.

I'm running with the following versions:
python 3.6.8
pytorch 0.4.0 (also tried 0.4.1 and 1.0.0)
torchtext 0.3.1
argparse 1.4.0

Thanks for your help!

code of machine translation

Hi,

Thank you for your work.

Is there an example of machine translation?

"task" argument has no effect

Hi,
currently the --task argument is being ignored, due to line 153ff in main.py, so the script always runs all bAbi tasks in a row.

Execution without cuda throws error

Hi,
when running the script on a machine without cuda support, I'm getting the following error:

File ".../Universal-Transformer-Pytorch/models/UTransformer.py", line 236, in forward
halting_probability = torch.zeros(inputs.shape[0],inputs.shape[1]).cuda()
RuntimeError: torch.cuda.FloatTensor is not enabled.

I suppose the lines 236 - 242 in UTransformer.py require an additional cuda-check.

if-statement on projecting embedding to hidden size

I found that in models/UTransformer.py:110&194, you have the following codes:

self.proj_flag = False
if(embedding_size == hidden_size):
    self.embedding_proj = nn.Linear(embedding_size, hidden_size, bias=False)
    self.proj_flag = True

I'm confused that you project embedding to hidden_size when embedding_size==hidden_size, but what if embedding_size!=hidden_size? Doing nothing? Wouldn't it leads to size mismatch?

Question about PE

Hi, I did notice you implement the function to calculate the position embedding. However, I found nowhere it was used. Can you please help me understand how you incorporate the position information into the model?

Questions about the result

Very glad to find this codes. I'm a little confused the output result. Is the output acc the accuracy or the error rate? According to the code, it seems to be accuracy. But if so, the accuracy is too small as I got the output acc as almost 0. Could you please help me to address the problem?

ReLU in PositionwiseFeedForward

Here i is the index of self.layers, therefore it is always less than the length of self.layers.

Probably you mean

if i < len(self.layers) - 1

Then no ReLU and Dropout after the last positionwise layer.

Manifesting interest to this work

Hi, i found this implementation very interesting.
I would like to understand more about Universal Transformer since i think this could allow much smaller LLMs with higher performance.

p.s. i am italian too

Can i ask you something about?

It looks like ACT does not upate state

state passed to fn does not seem to be updated by ACT's masks, only previous_state ?

https://github.com/andreamad8/Universal-Transformer-Pytorch/blob/master/models/UTransformer.py#L280

As such the dynamic halting seems to only kick in once all halting_probabilities go above threshold?

previous state has no effect on state

        previous_state = ((state * update_weights.unsqueeze(-1)) + (previous_state * (1 - update_weights.unsqueeze(-1))))

encoder-decoder problem

in forward function,why only has encoder without decoder?

probability exceed threshold at step 2 from second epoch onwards

hi,

when I run the model, I realize at first epoch it can reach max step 24, but start from second or third epoch, the probability by "p = self.sigma(self.p(state)).squeeze(-1)" become very near to threshold and it will exceed at step 2. So my encoder layer become only has 2 layer. Any idea why?