Implementation of a Multi-Head Attention Layer in C++ from scratch
Multi Head attention was introduced in Attention is All You Need paper
Implemented scaled dot product attention:
And using the implementation of this attention head structure the following Multihead Attention layer design was implemented:
Classic Linear layer was implemented as well: