We use encoder-decoder format with CNN + BiDirectional LSTM. Deep layers are stacked or have residual connections.
This is a partial implementation of the paper by Wang et al. published in 2018 in ACM Transactions on Multimedia Computing, Communications, and Applications.
Note: This is a work in progress. I will explain the details in a separate blog post as well. This directory is public as the work is in progress. kindly offer your valuable suggestions, if possible.