GithubHelp home page GithubHelp logo

s3d_howto100m's Introduction

PyTorch S3D Text-Video trained HowTo100M

This repo contains a PyTorch S3D Text-Video model trained from scratch on HowTo100M using MIL-NCE [1] If you use this model, we would appreciate if you could cite [1] and [2] :).

The official Tensorflow hub version of this model can be found here: https://tfhub.dev/deepmind/mil-nce/s3d/1 with a colab on how to use it here: https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/text_to_video_retrieval_with_s3d_milnce.ipynb

Getting the data

You will first need to download the model weights and the word dictionary.

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

How To use it ?

The following code explain how to instantiate S3D Text-Video with the pretrained weights and run inference on some examples.

import torch as th
from s3dg import S3D

# Instantiate the model
net = S3D('s3d_dict.npy', 512)

# Load the model weights
net.load_state_dict(th.load('s3d_howto100m.pth'))

# Video input should be of size Batch x 3 x T x H x W and normalized to [0, 1] 
video = th.rand(2, 3, 32, 224, 224)

# Evaluation mode
net = net.eval()
 
# Video inference
video_output = net(video)

# Text inference
text_output = net.text_module(['open door', 'cut tomato'])

NB: The video network is fully convolutional (with global average pooling in time and space at the end). However, we recommend using T=32 frames (same as during training), T=16 frames also works ok. For H and W we have been using values from 200 to 256.

video_output is a dictionary containing two keys:

  • video_embedding: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding.
  • mixed_5c: This is the global averaged pooled feature from S3D of dimension 1024. This should be use for classification on downstream tasks.

text_output is also a dictionary with a single key:

  • text_embedding: It is the text embedding (size 512) from the joint text-video space. To compute the similarity score between text and video, you would compute the dot product between text_embedding and video_embedding.

Computing all the pairwise video-text similarities:

The similarity scores can be computed with a dot product between the text_embedding and the video_embedding.

video_embedding = video_output['video_embedding']
text_embedding = text_output['text_embedding']
# We compute all the pairwise similarity scores between video and text.
similarity_matrix = th.matmul(text_embedding, video_embedding.t())

References

[1] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic and A. Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos. https://arxiv.org/abs/1912.06430

[2] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev and J. Sivic, HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. https://arxiv.org/abs/1906.03327

Bibtex:

@inproceedings{miech19howto100m,
   title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips},
   author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef},
   booktitle={ICCV},
   year={2019},
}

@inproceedings{miech19endtoend,
   title={{E}nd-to-{E}nd {L}earning of {V}isual {R}epresentations from {U}ncurated {I}nstructional {V}ideos},
   author={Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew},
   booktitle={CVPR},
   year={2020},
}

Acknowledgements

We would like to thank Yana Hasson for the help provided in the non trivial porting of the original Tensorflow weights to PyTorch.

s3d_howto100m's People

Contributors

antoine77340 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

s3d_howto100m's Issues

Weights link is no longer working

Hi, it seems the link to download the weights is down.

Is there a way to upload to a new link or make it working again?

linking from this issue here for code for our paper as it relies on the S3D pre-trained weights :') sumedh7/RoboCLIP#4

How to normalize the images?

The images should be normalized to [0, 1], what about the mean and std? For example, the normalization for pytorch pre-trained models is: normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

Fine-tune the model

Hi Antoine,

If I want to fine-tune this model, what's the format of training examples? Do you have a sample code for fine-tuning? Thanks!

Best,
Yue

s3d pretrained model on kinetics

Hi, Antoine,

thanks for open source such a great work! The pretrained weights is very useful. Recently, I am trying compare your weights pretrained on Howto100m and those on kinetics. However, I failed to find the later one on your repo or any other places. I am wondering whether you have pretrained weights on kinetics on your side? Thanks!

best,

Can't reproduce results for YouCookII

I took this model

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

and code from this repository, I take validation part of youcookII and try to achieve numbers
mentioned in the article End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Capture

It is unclear which protocol did you use for testing. In the following table I show several experiments and none of them could achieve your results. Could you clarify which test protocol did you use for testing? It will be good if you publish script for testing.

What I try.

  • T is time in seconds. I split each clip to subclips each has length T seconds. For each subclip embedding will be compute.
  • pooling If clip was split to >1 subclips embeddings will be averaged to pooling
  • imgsz Short side of each source video will be rescaled to imgsz with h:w preserving. Then
    center crop will be taken for each frame.
  • normalize Whether or not sentence embedding and each video embedding was L2-normalized before dot-product.
  • num frames From each T-seconds clip num frames was taken in uniform style.
  • num resample For each clip sample different num resample sets of frames. For each resample compute embedding. With pooling all embedding will be polled to single one. LCR means sample from each clip 3 times: num frames left crops, num_frames right crops, num frames center crops.
T imgsize pooling normalize num frames num resample R@1 R@5 R@10 MedR
250 200 max False 32 1 11.478 27.610 37.453 21
250 224 max False 32 1 8.774 22.044 30.975 32
250 256 max False 32 1 5.912 15.503 21.038 104
1.5 200 max False 32 1 8.333 23.208 31.981 31
3.2 200 max False 32 1 9.497 24.969 34.654 24
8 200 max False 32 1 10.094 25.818 35.849 23
16 200 max False 32 1 10.755 26.478 36.541 21
32 200 max False 32 1 11.164 27.484 37.296 21
64 200 max False 32 1 11.415 27.704 37.547 21
128 200 max False 32 1 11.447 27.610 37.453 21
250 200 max True 32 1 9.906 25.031 34.748 25
250 200 max False 32 2 11.604 28.270 37.987 20
250 200 max False 32 3 11.918 28.396 38.333 21
250 200 max False 32 4 11.509 28.082 38.365 21
250 200 max False 32 LCR 11.384 27.138 37.704 22
250 200 mean False 32 4 12.075 28.805 38.459 20

Video configuration for best performance under limited computing resource

Hi Antoine,

First of all, great work! The codes are extremely friendly to use, I'd like to thank your efforts.

I'm trying to use your model as the first step of my own project to extract good features for both video and language. It would be great if you could advise on some doubts I have.

If I understand correctly, the model best performs on video clips under "FPS = 10, 32 frames (3.2 sec)". Due to my limit of computing resource (basically GPU memory), I'd like to downscale this config. What rules should I stick to in this situation? Should I stick to the config of "clip being 3.2 sec", hence video clips be like "FPS = 2.5, 8 frames (3.2 sec)", or should I stick to "FPS being 10", hence use something like "FPS = 10, 8 frames (0.8 sec)"?

Second, to what extend do you recommend finetuning the params of your pretrained MIL-NCE? I think it would be safe to assume that finetuning will always help on downstream tasks, but I have little sense on how much it could help in our case. Maybe you could also advise on this?

Thank you in advance.

Question in Sentence Embedding

I have a question about the Sentence_Embedding model forward implementation.

Why is torch max applied after the first Fully Conbected layer? Is this better than doing it before and averaging all the word embeddings of a sentence before the FC Layers?

Thanks for the clarification

Video preprocessing steps?

Hi @antoine77340,

Thanks for sharing this codebase. I'd like to evaluate your pretrained model on several custom videos but I don't see any code/instruction on how to preprocess the videos for inference. Could you share some insight?

Thanks,

pre-extracted video embedding from joint space

Hi Antoine,

I am impressed by your excellent work which is very helpful to my research!

I would like to know if you have extracted the joint-space feature (512d) for all clips in Howto100M that I can directly download?

I have already downloaded the S3D features for all clips, is there any way to convert these feature vectors to the joint space?

Thank you very much!

Yue

finetune results on UCF101

Hi, I am having some trouble reproducing the finetune resuts on UCF101. I can only get 88.6 while the reported result is 91.3. Could you share your finetune script or procedure/hyper-params? Thanks!

Dramatically accuracy drop with JPG compression

I tested your model on YouCookII with this protocol (4x32 contiguous frames at 10FPS). I extrcated images from video in two ways.

  1. ffmpeg -y -i <INPUT.mp4> -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg
  2. ffmpeg -y -i <INPUT.mp4> -qscale:v 2 -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg

The first one compresses output JPGs, the last one save JPGs with the best quality.

The example on 1, 2 and -qscale:v 31 (poorest quality). Please ignore H/W ratio, in testing I use correct H/W ration.

download (4)
download (5)
download (6)

The difference between 1 and 2 is small.

source R@1 R@5 R@10 MedR
results in article 15.1 38 51.2 10
my retest. ffmpeg best quality 15.975 38.208 50.126 10
my retest. ffmpeg default quality 10.629 27.201 7.925 20

Note: some videos from YouCookII are unavailable today, so I tested only on available videos.

Despite small difference between 1 and 2, the test difference is sufficient. It may be because some intersection between YouCookII and HowTo100M wasn't filtered, and network learned some videos from this intersection.

My question is. Are you sure that intersection between YouCookII and HowTo100M was completely removed from train dataset? Could you post in this thread youtube video ids that was used for train? (or that was thrown away?). I want to do double check about intersection.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.