GithubHelp home page GithubHelp logo

zzwei1 / timesformer-rolled-attention Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yiyixuxu/timesformer-rolled-attention

0.0 0.0 0.0 3.1 MB

Visualizing the learned space-time attention using Attention Rollout

Python 0.39% Jupyter Notebook 99.61%

timesformer-rolled-attention's Introduction

Visualizing the learned space-time attention

This repository contains implementations of Attention Rollout for TimeSformer model.

Attention Rollout was introduced in paper Quantifying Attention Flow in Transformers. It is a method to use attention weights to understand how a self-attention network works, and provides valuable insights into which part of the input is the most important when generating the output.

It assumes the attention weights determine the proportion of the incoming information that can propagate through the layers and we can use attention weights as an approximation of how information flow between layers. If A is a 2-D attention weight matrix at layer l, A[i,j] would represent the attention of token i at layer l to token j from layer l-1. And to compute the attention to the input tokens from the video, it recursively multiply the attention weights matrices, starting from the input layer up to layer l.

Implementating Attention Rollout for TimeSformer

For divided space-time attention, each token has 2 dimensions, let's denote the token as z(p,t), where p is spatial dimension and t is the time dimension;

Each encoding block contains a time attention layer and a space attention layer. During time attention block, each patch token only attends to patches at same spatial locations; During space attention, each patch only attends to the patches from same frame. If we use T and S to denote time attention weights and space attention weights respectively,T[p,j,q] would represent the attention of z(p,j) to z(p,q) from previous layer during time attention layer and S[i,j,p] would represent the space attention of z(i,j) to z(p,j) from time attention layer;

When we combined the space and time attention, each patch token will attends to all patches at every spatial locations from all frames (with the exception of the cls_token, we will discuss about it later) through an unique path. The attention path of z(i,j) to z(p,q) (where p != 0) is

  • space attention: z(i,j)-> z(p,j)
  • time attention: z(p,j)-> z(p,q)

we can calculate the combined space time attention W as

W[i,j,p,q] = S[i,j,p]* T[p,j,q]

note that the classification token did not participate in the time attention layer - it was removed from the input before it enter the time attention layer and added back before passing to the space attention layer. This means it only attends to itself during time attention computation, we use an identity matrix to account for this. Since classification did not participate in time attention computation, all the tokens will only be able to attend to classification token from same frame, to address this limitation, in TimeSformer implementation, the cls_token output is averaged across all frames at end of each space-time attention block, so that it will be able to carry information from other frames, we also need to average its attention to all input tokens when we compute the combined space time attention

Usage

Here is a notebook demostrate how to use attention rollout to visualize space time attention learnt from TimeSformer

a colab notebook: Visualizing learned space time attention with attention rollout

Visualizing the learned space time attention

This is the example used in the TimeSformer paper to demonstrate that the model can learn to attend to the relevant regions in the video in order to perform complex spatiotemporal reasoning. we can see that the model focuses on the configuration of the hand when visible and the object-only when not visible. alt text alt text

References

timesformer-rolled-attention's People

Contributors

yiyixuxu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.