GithubHelp home page GithubHelp logo

pi-tau / transformer Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 196 KB

The Transformer model implemented from scratch using PyTorch. The model uses weight sharing between the embedding layers and the pre-softmax linear layer. Training on the Multi30k machine translation task is shown.

Python 100.00%
deep-learning multi-head-attention pytorch transformer machine-translation multi30k shared-embedding

transformer's Introduction

TRANSFORMER

The transformer model was first introduced in the paper:

  • "Attention is all you need" by Vaswani et. al., (here)

For a really thorough and in-depth discussion of the implementation of the Transformer you can check out a blog post that I wrote about it.

The transformer is used for modelling sequence-to-sequence tasks (like machine translation), where the input sequence is encoded using an encoder and then the output sequence is produced using a decoder.

TOKEN EMBEDDING

Before feeding the sequence to the transformer we have to pass the tokens through a standard embedding layer. We also need to encode the order of the sequence, since order information is not built-in. Thus, we use a position embedding layer, that maps each sequence index to a vector. The word embedding and the position embedding are then added and dropout is applied to produce the final token embedding.

$$\begin{aligned} TokenEmbed(elem_i): \\\ & emb_i = WordEmbed(elem_i) + PosEmbed(i) \\\ & x_i = Dropout(emb_i) \end{aligned}$$

"Embedding"

ATTENTION

The query, key and value linear layers ($Q$, $K$, $V$) are used to encode the input. Dot-products between query and key vectors produces the attention scores, which are used to perform weighted summation of the value vectors. Dropout is applied to the attention weights directly before the summation.

"Attention"

Each self-attention block has several sets of $Q$, $K$ and $V$ layers. Each set is called an attention head.

ENCODER

The encoder consists of $N$ identical blocks applied one after another. Each encoder block has two sub-layers: a self-attention layer followed by a position-wise fully-connected network. The block also incorporates layer normalization layers which are added before the sub-layers, and dropout layers added after the sub-layers. Finally, a residual connection is applied after both the self-attention and the fully-connected layers.

$$\begin{aligned} Encoder(x): \\\ & z = x + Dropout(Attn(Norm(x))) \\\ & r = z + Dropout(FFCN(Norm(z))) \end{aligned}$$

"Encoder"

DECODER

The decoder is also composed of $N$ identical blocks. The decoder block is very similar to the encoder block, but with two differences:

  1. The decoder uses masked self-attention, meaning that current sequence elements cannot attend future elements.
  2. In addition to the two sub-layers, the decoder uses a third sub-layer, which performs cross-attention between the decoded sequence and the outputs of the encoder.
$$\begin{aligned} Decoder(x): \\\ & z = x + Dropout(MaskAttn(Norm(x))) \\\ & c = z + Dropout(CrossAttn(Norm(z))) \\\ & r = c + Dropout(FFCN(Norm(c))) \end{aligned}$$

"Decoder"

TRANSFORMER

The full transformer model is constructed by plugging in the outputs of the final encoder block to the cross-attention layer of every decoder block. Finally, the outputs of the final decoder block are forwarded through a linear layer to produce scores over the target vocabulary.

"Transformer"

transformer's People

Contributors

pi-tau avatar ptashev-cinemo avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

transformer's Issues

Embedding scaling

Thank you for the excellent blog post and code.

Looking at the embedding scaling logic, it looks like you are first scaling the word embedding weights by 1/scale during initialisation and then you are scaling the word embeddings after lookup by scale. Is this intentional? Don't they cancel each other?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.