This notebook contains a simple implementation of a Mixture-of-Experts Transformer.
It is closest to a Switch Transformer, in the following respects:
- It routes each token to one expert like ST
- It uses the same loss to balance experts like ST
- It shrinks the parameters with which each expert is initialized, per that paper's advice
It also differs from it in many important ways:
- Per-character encoding, rather than systematic tokenization
- Switch Transformer, obviously, works on multiple GPUs -- this does not
- This simple does causal language modeling, while Switch Transformer had a masked language modeling target
- I use Hard-Alibi position encoding, which I prefer because of (1) toy experiments that show it generalizing better, and (2) drop-dead simplicity that made it work better than alternatives in my first experiments
- No dropout
- Probably tons of stuff that I missed
The point is that it should be a simple-enough MoE implementation that you can just upload some data, train off it, and get lower loss using a MoE than with a Dense Transformer.
A place to start learning, at least.