GithubHelp home page GithubHelp logo

pkurainbow / s3d_howto100m Goto Github PK

View Code? Open in Web Editor NEW

This project forked from antoine77340/s3d_howto100m

0.0 1.0 0.0 25 KB

S3D Text-Video model trained on HowTo100M using MIL-NCE

License: Apache License 2.0

Python 100.00%

s3d_howto100m's Introduction

PyTorch S3D Text-Video trained HowTo100M

This repo contains a PyTorch S3D Text-Video model trained from scratch on HowTo100M using MIL-NCE [1] If you use this model, we would appreciate if you could cite [1] and [2] :).

The official Tensorflow hub version of this model can be found here: https://tfhub.dev/deepmind/mil-nce/s3d/1 with a colab on how to use it here: https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/text_to_video_retrieval_with_s3d_milnce.ipynb

Getting the data

You will first need to download the model weights and the word dictionary.

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

How To use it ?

The following code explain how to instantiate S3D Text-Video with the pretrained weights and run inference on some examples.

import torch as th
from s3dg import S3D

# Instantiate the model
net = S3D('s3d_dict.npy', 512)

# Load the model weights
net.load_state_dict(th.load('s3d_howto100m.pth'))

# Video input should be of size Batch x 3 x T x H x W and normalized to [0, 1] 
video = th.rand(2, 3, 32, 224, 224)

# Evaluation mode
net = net.eval()
 
# Video inference
video_output = net(video)

# Text inference
text_output = net.text_module(['open door', 'cut tomato'])

NB: The video network is fully convolutional (with global average pooling in time and space at the end). However, we recommend using T=32 frames (same as during training), T=16 frames also works ok. For H and W we have been using values from 200 to 256.

video_output is a dictionary containing two keys:

  • video_embedding: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding.
  • mixed_5c: This is the global averaged pooled feature from S3D of dimension 1024. This should be use for classification on downstream tasks.

text_output is also a dictionary with a single key:

  • text_embedding: It is the text embedding (size 512) from the joint text-video space. To compute the similarity score between text and video, you would compute the dot product between text_embedding and video_embedding.

Computing all the pairwise video-text similarities:

The similarity scores can be computed with a dot product between the text_embedding and the video_embedding.

video_embedding = video_output['video_embedding']
text_embedding = text_output['text_embedding']
# We compute all the pairwise similarity scores between video and text.
similarity_matrix = th.matmul(text_embedding, video_embedding.t())

References

[1] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic and A. Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos. https://arxiv.org/abs/1912.06430

[2] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev and J. Sivic, HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. https://arxiv.org/abs/1906.03327

Bibtex:

@inproceedings{miech19howto100m,
   title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips},
   author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef},
   booktitle={ICCV},
   year={2019},
}

@inproceedings{miech19endtoend,
   title={{E}nd-to-{E}nd {L}earning of {V}isual {R}epresentations from {U}ncurated {I}nstructional {V}ideos},
   author={Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew},
   booktitle={CVPR},
   year={2020},
}

Acknowledgements

We would like to thank Yana Hasson for the help provided in the non trivial porting of the original Tensorflow weights to PyTorch.

s3d_howto100m's People

Contributors

antoine77340 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.