lucidrains / x-transformers Goto Github PK

View Code? Open in Web Editor NEW

4.3K 52.0 368.0 38.14 MB

A simple but complete full-attention transformer with a set of promising experimental features from various papers

License: MIT License

Python 100.00%

artificial-intelligence deep-learning attention-mechanism transformers

x-transformers's People

Contributors

Stargazers

Watchers

Forkers

zeta1999 assassinsurvivor gheyret roholazandie adrian-spataru mldl lunixbochs wangcongcong123 tubbz-alt kingoflolz manojkesani yangsenwxy forks-learning ml-and-ai-repo gyq716 hadaev8 chaoso adrianmargin silver-birch-wawa dukemilo shunsunsun laszlokiraly sergioarnaud cxz jawaechan dumpmemory yu-shui guillefix sailfish009 proishan11 xinshao-wang kapitsa2811 mbrukman hbcbh1999 monad-one dskov xtrigold pascalnotin superxiang gurkenglas neverix jbcdnr mattmcpartlon jhxu-org lahiruts cfoster0 qingxinhu123 rom1504 ocorcoll java-abhinav07 hadorganization kinkir tmphex vein1990 zhendongwang6 ashishpatel26 booydar stas-sl napoler mathsml doodlebot-ai srv902 seutao atze00 vasiliyeskin techthiyanes barvin04 maxylee ajinkyapuar nakajimakou1 gurvindersingh jeffhsu3 iloveoreo afcarl davidshisui zlapp golubovic rg6969 ncoop57 zxhjames jsyzc2019 k-tanjirou nostalgebraist merouone judepark96 fwka92 jia-creator eforw maerad7 evdcush kumapowerliu xin-zhou-smu vovanphuc galatolofederico neonbjb ukdsvl gdevos010 nikolatesla56 a-antoniades xuanzhangyang

x-transformers's Issues

Reason for doing partial rotary embedding?

In a recent commit only half the head vector is "rotated", could this improve the overall performance? Thanks!
4b395ab

-    self.rotary_pos_emb = RotaryEmbedding(dim_head) if rotary_pos_emb else always(None)
+    rotary_emb_dim = max(default(rotary_emb_dim, dim_head // 2), 32)

Memory Efficiency w.r.t Sequence Length

I am a bit of a noob when it comes to transformers. If I want to encode a batch of N sequences of maximum length L, my understanding is that I do something like this:

from x_transformer import Encoder, TransformerWrapper
seqs = ['aba','cb','abcab']
N = len(seqs)
L = max(len(seq) for seq in seqs)
C = 3
padded_seqs = get_padded_seqs(seqs) # N x L long tensor
mask = get_seq_mask(seqs) # N x L boolean tensor
encoder = TransformerWrapper(num_tokens=C,max_seq_len=L,attn_layers=Encoder())
embeddings = encoder(padded_seqs,mask=mask,return_embeddings=True)

In this transformer implementation, would there be a difference in memory usage if all of the sequences were of length L (i.e. all the mask values were True)?

Add tensorflow 2 version of them package

Hi,
It's a really great package. Could you please kindly consider creating the TF 2 version of this package?

Pay Attention When Required

First, thanks for the great repo!

Here's a recent paper from NVIDIA: https://arxiv.org/pdf/2009.04534v2.pdf
Seems like a similar concept to Sandwich, but faster, simpler, and near identical perplexity.

Edit: Oh, I see you mention it already. Is there a parameter exposed for it already?

This was further corroborated by a paper by Nvidia that reduces the number of attention layers to be 1/3rd of the feedforwards without loss in performance.

AutoCast for mixed precision/fp16 fails?

I have tried to train the model using torch.cuda.amp.autocast() but the training doesn't seems to speeds up or memory usage remains same as with fp32 training.
also model size remain same with or without autocast
Can you help what could be the reason.
i also used huggingface Accelerate :https://github.com/huggingface/accelerate but cant achieve mixed precision.

Simple feature request: transformers for continuous inputs

I think it would be useful to add an option to the TransformerWrapper, or perhaps to make a new Wrapper type, that does not use embedding layers, so that the inputs are real-valued vectors. This would allow to use x-transformers for tasks with continuous inputs. For example here they use transformers in that way, and also most of the times transformers are applied to regression tasks.

Note that in this case, one then just talks about input and output dimension, and not number of tokens.

Perhaps the cleanest way to do this is to make a new Wrapper type that works with continuous vector inputs, and then make TransformerWrapper use this ContinuousTransformerWrapper inside it, with input dimension being the embeding dimension, and output dimension being num_tokens. Hope this makes sense!

Shared Embeddings

Sharing the token_emb between Encoder & Decoder is not by default. Lot of transformers like BART/T5 use a shared encoder/decoder embedding.

model = XTransformer(
    dim = 512,
    enc_num_tokens = 256,
    enc_depth = 6,
    enc_heads = 8,
    enc_max_seq_len = 1024,
    dec_num_tokens = 256,
    dec_depth = 6,
    dec_heads = 8,
    dec_max_seq_len = 1024,
    enc_num_memory_tokens = 0,
    
)

model.decoder.token_emb = model.encoder.token_emb

Would be this enough?

Furthermore the example for Encoder/Decoder in ReadMe doesn't work out of the box, it needs also a value for enc_num_memory_tokens

`rotary_pos_emb = True` causes an exception to be raised when the model is pickled.

import torch
from x_transformers import ContinuousTransformerWrapper, Encoder

model = ContinuousTransformerWrapper(
    max_seq_len=128,
    attn_layers = Encoder(
        dim = 32,
        depth = 2,
        heads = 1,
        rotary_pos_emb = True # This line is the problem.
    )
)

with open("qwe.nnreg",'wb') as f: torch.save(model,f)

Traceback (most recent call last):
  File "c:/Users/Marko/Source/Repos/The Spiral Language/Spiral Compilation Tests/cython_experiments/ui_holdem8 (transformers)/script1.py", line 15, in <module>
    with open("qwe.nnreg",'wb') as f: torch.save(model,f)
  File "C:\Users\Marko\anaconda3\lib\site-packages\torch\serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "C:\Users\Marko\anaconda3\lib\site-packages\torch\serialization.py", line 484, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'always.<locals>.inner'

I'll have to skip using the rotary embedding until this is resolved. Without the highlighted line the pickling works fine.

[Question!] How to Inject Rotary Positional Embeddings in Linear Transformers

Hello Phil,

Do you mind how to inject the rotary positional embeddings into the linear transformers ?

import torch
from torch.nn import Module

from ..attention_registry import AttentionRegistry, Optional, Callable, Int, \
    EventDispatcherInstance
from ..events import EventDispatcher
from ..feature_maps import elu_feature_map


class LinearAttention(Module):
    """Implement unmasked attention using dot product of feature maps in
    O(N D^2) complexity.
    Given the queries, keys and values as Q, K, V instead of computing
        V' = softmax(Q.mm(K.t()), dim=-1).mm(V),
    we make use of a feature map function Φ(.) and perform the following
    computation
        V' = normalize(Φ(Q).mm(Φ(K).t())).mm(V).
    The above can be computed in O(N D^2) complexity where D is the
    dimensionality of Q, K and V and N is the sequence length. Depending on the
    feature map, however, the complexity of the attention might be limited.
    Arguments
    ---------
        feature_map: callable, a callable that applies the feature map to the
                     last dimension of a tensor (default: elu(x)+1)
        eps: float, a small number to ensure the numerical stability of the
             denominator (default: 1e-6)
        event_dispatcher: str or EventDispatcher instance to be used by this
                          module for dispatching events (default: the default
                          global dispatcher)
    """
    def __init__(self, query_dimensions, feature_map=None, eps=1e-6,
                 event_dispatcher=""):
        super(LinearAttention, self).__init__()
        self.feature_map = (
            feature_map(query_dimensions) if feature_map else
            elu_feature_map(query_dimensions)
        )
        self.eps = eps
        self.event_dispatcher = EventDispatcher.get(event_dispatcher)

    def forward(self, queries, keys, values, attn_mask, query_lengths,
                key_lengths):
        # Apply the feature map to the queries and keys
        self.feature_map.new_feature_map(queries.device)
        Q = self.feature_map.forward_queries(queries)
        K = self.feature_map.forward_keys(keys)

        # Apply the key padding mask and make sure that the attn_mask is
        # all_ones
        if not attn_mask.all_ones:
            raise RuntimeError(("LinearAttention does not support arbitrary "
                                "attention masks"))
        K = K * key_lengths.float_matrix[:, :, None, None]

        # Compute the KV matrix, namely the dot product of keys and values so
        # that we never explicitly compute the attention matrix and thus
        # decrease the complexity
        KV = torch.einsum("nshd,nshm->nhmd", K, values)

        # Compute the normalizer
        Z = 1/(torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1))+self.eps)

        # Finally compute and return the new values
        V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z)

        return V.contiguous()

Thanks!

will this be included?

Hi @lucidrains I just want to see great work and I got much inspiration from this repo,
just wondering, is this helpful?
https://arxiv.org/pdf/2103.15722.pdf

it seems they got pretty good result.

Feature request for another relative positional embedding

This paper claim to have the best relative positional embedding to the moment
https://arxiv.org/abs/2009.13658

Here is an example of implementation
https://github.com/hadaev8/transformers/blob/37712fdd1cb9ed83ebcc888f184296c135f90be4/src/transformers/models/bert/modeling_bert.py#L271

The only thing that should be considered for this lib is the distance of memory tokens.

with `mask` output is `nan`

I am trying to run image captioning example with the mask for caption. Without mask forward pass runs properly but with mask I get the nan. You can reproduce by running following simple example.

import torch
from x_transformers import ViTransformerWrapper, TransformerWrapper, Encoder, Decoder

encoder = ViTransformerWrapper(
    image_size = 256,
    patch_size = 32,
    attn_layers = Encoder(
        dim = 512,
        depth = 6,
        heads = 8
    )
)

decoder = TransformerWrapper(
    num_tokens = 20000,
    max_seq_len = 1024,
    attn_layers = Decoder(
        dim = 512,
        depth = 6,
        heads = 8,
        cross_attend = True
    )
)

img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))
mask = torch.ones_like(caption, dtype=torch.bool)
encoded = encoder(img)
decoder(caption, context = encoded, mask = mask)

output:

tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], grad_fn=<UnsafeViewBackward>)

How to make inference fast (by adding caching of key / values)

How can we implement the below caching technique in the code?

That would be awesome.
What I have tried to speed up the inference in my custom implementations for autoregressive self-attention is caching the output of the self-attention at timestep T and then, in timestep T+1, passing the full keys/values but only passing the last element of the query sequence, then getting the output and concatenating it with the cache, that way each query can pay attention to the full previous sequence but we don't need to compute attention for all the previous queries when we only need the output at T+1
It looks something like this:

But I only achieved a x3 speedup 🤔

I actually needed to perform autoregressive inference in a very large dataset, and it was taking more than 1 day even with the above speedup. I am currently doing some weird custom stuff, keeping the Transformer attention layers but replacing the self-attention layers with LSTMs, which are way faster at generating sequences token by token, and with that I achieve the x10 speedup that I needed.

Originally posted by @pabloppp in #21 (comment)

how to update encoder-decoder model training parameters

Hi, I checked this example in which encoder & decoder are defined inside one model. And from the code I understood model train forward/backword pass.

How will forward/backword pass work for lets say different encoder & different decoder (like you mentioned in Image -> caption example)

Any starting point/example would be appreciated.

sequence length independent generation

Currently generation require passing sequence length to generate sequences of given length but say in tasks such as summary or translation, one doesn't know about the final sequence length. Currently I am trying to generate candidates with passing various lengths as work around. Also is it possible to add support for beam search method for generation in addition to current top_p/top_k methods.

[feature request] leverage sota optimizer/activation function

https://github.com/lessw2020/Ranger21

Feature request for Disentangled Attention from DeBERTa

Paper in case you missed it.
https://arxiv.org/abs/2006.03654

Benchmarking Transformers

(Co-Author of ReZero)
I noticed this repository and it's very cool that there are many transformer variants being implemented. I wonder if there exists a benchmark for all these experimental features. If not, it might be a useful to benchmark all these variants and their performance on a table (efficiency, pre-training performance on The Pile, for example).

There have been review papers out there that perform some benchmarking of Transformer variants, but they get outdated very quickly. PDFs aren't a great format for this type of thing. It'll be highly useful if there's a working Github that contains these benchmarks and people can collaborate on to "add their method".

I'm interested in discussing ideas/collaborating if others are.

Invalid example at TransformerWrapper constructor

In the example you use "return_logits" argument in TransformerWrapper constructor but it doesn't exist.

x-transformers/examples/enwik8_simple/train.py

Line 43 in 652ae8b

return_logits = True

Encoder/decoder return attention maps?

Would be nice for logging.

Feature request for Image Transformer example

Thank you for creating a great framework for various transformer networks.
It would be great to add the image transformer (https://arxiv.org/abs/1802.05751).

Option to pass in memory_mask

Hi Phil,

Can we have the option to pass in memory_mask, just like in Official Pytorch Transformers?

Thanks.

Importing AttentionLayers

I want to have two encoders (not seq2seq) and seems like i cant use default abstractions.
Would be nice to be able to import AttentionLayers class from lib.

Hopfield Nets for memory purpose in x-transformers?

Hi, this x-transformers repo. is having alot of very useful features all at one place, though I was thinking if Modern hopfields may result in an increase in performance? The implementation is given here https://github.com/ml-jku/hopfield-layers
Though I couldn't understand how to use it for memory purposes.
What are your views about it? Are modern hopfields any useful as associative memory nets ? and if so, how should they be implemented? cause just adding them like lookup-layer didn't gave any special performance improvement.

Understanding the key and value transmission from the encoder output to the decoder

Once again, thank you Phil for the amazing work and time you put into your work. I appreciate it! Two questions though:

Do I understand correctly that the keys and values that are passed from the encoder output to every layer of the decoder is nothing else but one single entity/tensor of shape (batch_size, num_tokens, dim_embedding) and it is only inside of the decoder layers that this tensor is then split into a key and a value tensor by means of multiplications with respective learnable tensors? So there really is only one single tensor that is passed to the decoder as opposed to many illustrations, cf.

The split is performed in every decoder layer anew, correct?

I am trying to understand the resemblance/analogy to classical convnets in which after several layers of convolutions and pooling one usually uses the embeddings with reduced spatial dimension but increased feature/channel dimension for some further downstream tasks, cf. VGG architecture.

Is this akin to the encoder part of the Vision Transformer?
the output of the encoder are separate feature vectors for every token/image patch which seems to be different from the resulting embeddings of convnets...

Talking again about ViT, could a decoder be thought of as kind of an upsampling/generative counterpart (cf. GANs) of a downsampling part, i.e. the encoder? I try to relate encoder and decoder and their use cases to the more classical architectures in Deep Learning.

Thank you in advance!

Technical question

@lucidrains Hey bro!

I think I finally got it to work properly and it shows good results with music. I am really enjoying your creation. Very good job. Thanks.

@lucidrains

I do not know where to post questions on your GitHub so I am posting here.

I wanted to ask you if you know what "multiple-embedding" maybe? Have you ever heard about such a thing?

From what I understand, it means that the transformer can accept several tokens at once as input. Ideally, each layer can accept tokens, so if you have 6 layers, you would therefore feed it 6 tokens of input.

It is sorta like distributed training as far as I understand but for one worker.

Does any of it make sense? Can you consider looking into it? I think it would be a great addition to your x-transformer. It would make it fast as hell and probably more capable as the model will be able to make connections between input tokens. I.e. in music, this would allow multiple instruments for example.

And a side question....what is the hype around the reformer? Yes, it is tiny and trains well, but it is not even close to GPT3 afaik, so I am really confused...Is it cuz it's Google or something?

Thanks.

Your time and responses will be very much appreciated.

Alex

Why if not gate_residual: ?

Seems to me it should not have not

Gradient Checkpointing not working(implemented manually on my copy).

gradient checkpointing is not working with x-transformer. I implemented it in the main code, but it constantly shows different kinds of errors.

[Featue request] Expire span : Learning to Forget by Expiring

This is a nice feature that can complement this project:
Video by Yannic Kilcher: https://www.youtube.com/watch?v=2PYLNHqxd5A

Error with last update

  File "/usr/local/lib/python3.6/dist-packages/x_transformers/x_transformers.py", line 820, in __init__
    super().__init__(causal = False, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/x_transformers/x_transformers.py", line 732, in __init__
    residual_fn = Residual(dim)
TypeError: __init__() takes 1 positional argument but 2 were given

onnx model?

onnx conversion error:
triu operator not supported in line (for attention):
if.causal:
mask = torch.zeros((i,j)).triu_(j - i + 1).bool()

Any Help

Learned memory keyword change

I believe an earlier version may have used attn_num_mem_kv as the keyword argument for setting how many persistent memory vectors should be used. It looks like this is now num_mem_kv. The example in the readme has the older form.

Raise error if cross_attend=False but context passed

Just realized i forgot set true lol

Feature request for customizing inner dim of ff module

Hello again!
In this paper they use self attention dim 1024 and ff dim 768
https://arxiv.org/pdf/2010.10499.pdf

Is it possible to support fp16/mixed precision training

Hi @lucidrains

Thanks for this wonderful repository.

I think value rel_pos_num_buckets > rel_pos_max_distance should raise error

Feature request for adding Memformer memory

So i've checked the memformer repository and this one. I think it would be good to add the memory from the memformer to this project also since it seems more general and better than the transformer XL memory.

Scaling in normalization modules

Hi, thank you very much for publishing this awesome repository :) After studying the recent changes, I was confused with the introduction of self.scale = dim ** -0.5 into ScaleNorm and RMSNorm in this commit.

If I understand the code correctly, the modules multiply the normalized variable by dim ** 0.5 (these two lines divide the variable by dim ** -0.5). Since both the queries and the keys are multiplied like this, the attention matrix is effectively multiplied by dim, which goes against the usual practice to multiply it by dim ** -0.5 (as you also do in the code).

I believe the normalized variables shouldn't be multiplied like this, what is the reason behind scaling it? Thank you very much for your time.

Return error on one block transformer TypeError: 'int' object is not iterable

Code to reproduce:

from x_transformers import Encoder
test = Encoder(dim=768, depth=1, heads=12, rotary_pos_emb=True)
test(torch.rand(32, 128, 768))```

Different Inference Resuts?

I have trained a transformer model, and noticed something strange,
For same input, Final decoder output shape varies and few final decoded tokens are different for each inference., Example:
First Infernece:
Output from Decoder torch.Size([1, 178]) |
tensor([[ 87, 267, 11, 417, 319, 333, 290, 286, 383, 280, 418, 353, 336, 404,
286, 290, 292, 542, 365, 364, 493, 445, 52, 53, 560, 505, 40, 41,
354, 400, 291, 319, 408, 269, 51, 32, 268, 11, 291, 319, 408, 269,
277, 505, 32, 23, 24, 268, 11, 271, 98, 292, 418, 286, 290, 301,
560, 291, 319, 408, 269, 41, 32, 889, 268, 11, 271, 98, 292, 418,
286, 290, 301, 334, 653, 291, 319, 408, 269, 52, 268, 11, 291, 280,
354, 280, 313, 269, 277, 268, 11, 291, 286, 325, 280, 334, 319, 418,
280, 610, 55, 522, 450, 74, 85, 88, 374, 820, 578, 780, 269, 51,
52, 53, 268, 11, 326, 505, 40, 41, 269, 277, 268, 11, 935, 748,
269, 51, 32, 268, 11, 748, 269, 39, 32, 23, 24, 268, 11, 328,
576, 733, 326, 748, 269, 41, 32, 277, 906, 268, 11, 328, 576, 733,
17, 740, 595, 421, 283, 569, 287, 748, 269, 52, 34, 277, 906, 268,
11, 417, 292, 325, 301, 505, 576, 733, 269, 2]], device='cuda:0')

Second time infer for same input:
Output from Decoder torch.Size([1, 183]) |
tensor([[ 87, 267, 11, 417, 319, 333, 290, 286, 383, 280, 418, 353, 336, 404,
286, 290, 292, 542, 365, 364, 493, 269, 51, 52, 53, 268, 11, 560,
505, 40, 41, 354, 400, 291, 319, 408, 269, 51, 32, 268, 11, 291,
319, 408, 269, 277, 505, 32, 23, 24, 268, 11, 271, 98, 292, 418,
286, 290, 301, 560, 291, 319, 408, 269, 41, 32, 889, 268, 11, 271,
98, 292, 418, 301, 334, 653, 370, 291, 319, 408, 269, 52, 268, 11,
291, 280, 354, 280, 313, 269, 277, 268, 11, 291, 286, 325, 280, 334,
319, 418, 280, 610, 55, 522, 450, 74, 85, 88, 374, 820, 578, 780,
269, 51, 52, 53, 268, 11, 326, 505, 40, 41, 269, 277, 268, 11,
935, 748, 269, 51, 32, 268, 11, 748, 269, 39, 32, 23, 24, 268,
11, 328, 576, 733, 326, 748, 269, 41, 32, 277, 906, 268, 11, 328,
576, 733, 17, 740, 595, 421, 283, 569, 287, 748, 269, 52, 268, 11,
610, 269, 277, 889, 268, 11, 417, 292, 325, 301, 505, 576, 733, 269,
2]], device='cuda:0')

Is this behaviour normal for decoder?

attn_dropout and ff_dropout implemented?

They're mentioned it in the documentation, however they are not present in the source code.

No return_embeddings in ViTransformerWrapper

For image -> caption example in README

encoder = ViTransformerWrapper(
image_size = 256,
patch_size = 32,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8
)
)

decoder = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
cross_attend = True
)
)

img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

encoded = encoder(img, return_embeddings = True)
decoder(caption, context = encoded) # (1, 1024, 20000)

There is no field "return_embeddings"

Rotary Embedding for different q, k sequence length

In the attention layer, only the rotary embedding for the x input (query) is calculated, which would leads to error when key's sequence length does not equal the seqlen of the input x.

x-transformers/x_transformers/x_transformers.py

Line 566 in 4b395ab

rotary_pos_emb = self.rotary_pos_emb(x)

rotary_pos_emb = self.rotary_pos_emb(x)

Transformer-XL recurrence different from how it is presented in the paper

The current Transformer-XL implementation uses attention length equal to the input segment length plus the memory length, while in the paper the attention length is presented as independent from the input length or the memory length. This behavior is unwanted since you can't benefit from the extended receptive field presented in figure 2. https://arxiv.org/pdf/1901.02860.pdf
A solution could be to use an attention mask providing a further parameter to the model that automatically generates the attention mask. A snippet of code of how it could be implemented:

if self.causal:
    i, j = dots.shape[-2:]
    r = torch.arange(i, device = device)
    distance = rearrange(r, 'j -> () () () j') - rearrange(r, 'i -> () () i ()')
    mask = distance > 0
    if self.att_len:
        mask_2 = distance < self.att_len
        mask = torch.logical_and(mask, mask_2)
        del mask_2
    mask = F.pad(mask, (j - i, 0), value = False)
    dots.masked_fill_(mask, mask_value)
    del mask

Any tips for speeding up generation?

Because of the autoregressive nature of Transformers, I know that they are fairly slow when generating new sequences from scratch, but I was wondering if you had any tips or tricks on how to do faster inference or to know if you had plans for maybe adding some of the tricks to avoid full computation, like the ones used by Huggingface https://huggingface.co/blog/accelerated-inference

Thank you very much for your amazing work!

Since version 12.4 the first example in the readme gives an error: KeyError: 'emb_dropout'

import torch
from x_transformers import XTransformer

model = XTransformer(
    dim = 512,
    enc_num_tokens = 256,
    enc_depth = 6,
    enc_heads = 8,
    enc_max_seq_len = 1024,
    dec_num_tokens = 256,
    dec_depth = 6,
    dec_heads = 8,
    dec_max_seq_len = 1024,
    tie_token_emb = True      # tie embeddings of encoder and decoder
)

src = torch.randint(0, 256, (1, 1024))
src_mask = torch.ones_like(src).bool()
tgt = torch.randint(0, 256, (1, 1024))
tgt_mask = torch.ones_like(tgt).bool()

loss = model(src, tgt, src_mask = src_mask, tgt_mask = tgt_mask) # (1, 1024, 512)
loss.backward()

Talking head postprocess seems to be wrong

Output of this line is not used anywhere: https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py#L264

My Experience with X-Transformers

I have run some models in the past weeks. All of them being encoder-decoder transformers.
I am not sure where is the right place to write stuff like this, but I'll write them here for now.

Word of Caution: My particular use case is not NLP. But its a corpus with around 200M Tokens and vocab_size of 1k

Transformers Without Tears
Researchers have shared with me this leads to faster convergence.

This did lead to faster convergence in the beginning, but performance was slightly worse. (Ran 2 Runs)

GLU Variants Improve Transformer

Took longer to converge and wasn't better (Ran 2 Runs)

Rezero Is All You Need

Didn't converged for me and became after a while NaN (Ran 2 Runs)

T5's Simplified Relative Positional Encoding

Converged quicker and was better, even when wrongly configured (used max_distance 128, instead of 512, which is my max_seq_len)
For Seq_len of 512, a bucket_size of 64, was better than default 32. (One Run each)

Talking-Heads Attention

Didn't noticed anything for my usecase ( 1 Run only)

Weights from hugging face bert does not return same result

Here is notebook
https://colab.research.google.com/drive/1Gn8SOLPVbMeFQ6voBd2fgUlaTqz3uVll?usp=sharing

I checked code from both libs and didn't found a difference

Feature request for Customizable ff module

So I want to experiment with different ff modules like convolution or even without it.
One way is adding ff param, another just allowing to import Attention module.

lucidrains / x-transformers Goto Github PK

x-transformers's People

Contributors

Stargazers

Watchers

Forkers

x-transformers's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs