The Transformer model implemented from scratch using PyTorch. The model uses weight sharing between the embedding layers and the pre-softmax linear layer. Training on the Multi30k machine translation task is shown.
Looking at the embedding scaling logic, it looks like you are first scaling the word embedding weights by 1/scale during initialisation and then you are scaling the word embeddings after lookup by scale. Is this intentional? Don't they cancel each other?