antoine77340 / s3d_howto100m Goto Github PK

S3D Text-Video model trained on HowTo100M using MIL-NCE

License: Apache License 2.0

Python 100.00%

s3d_howto100m's Introduction

PyTorch S3D Text-Video trained HowTo100M

This repo contains a PyTorch S3D Text-Video model trained from scratch on HowTo100M using MIL-NCE [1] If you use this model, we would appreciate if you could cite [1] and [2] :).

The official Tensorflow hub version of this model can be found here: https://tfhub.dev/deepmind/mil-nce/s3d/1 with a colab on how to use it here: https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/text_to_video_retrieval_with_s3d_milnce.ipynb

Getting the data

You will first need to download the model weights and the word dictionary.

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

How To use it ?

The following code explain how to instantiate S3D Text-Video with the pretrained weights and run inference on some examples.

import torch as th
from s3dg import S3D

# Instantiate the model
net = S3D('s3d_dict.npy', 512)

# Load the model weights
net.load_state_dict(th.load('s3d_howto100m.pth'))

# Video input should be of size Batch x 3 x T x H x W and normalized to [0, 1] 
video = th.rand(2, 3, 32, 224, 224)

# Evaluation mode
net = net.eval()
 
# Video inference
video_output = net(video)

# Text inference
text_output = net.text_module(['open door', 'cut tomato'])

NB: The video network is fully convolutional (with global average pooling in time and space at the end). However, we recommend using T=32 frames (same as during training), T=16 frames also works ok. For H and W we have been using values from 200 to 256.

video_output is a dictionary containing two keys:

video_embedding: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding.
mixed_5c: This is the global averaged pooled feature from S3D of dimension 1024. This should be use for classification on downstream tasks.

text_output is also a dictionary with a single key:

text_embedding: It is the text embedding (size 512) from the joint text-video space. To compute the similarity score between text and video, you would compute the dot product between text_embedding and video_embedding.

Computing all the pairwise video-text similarities:

The similarity scores can be computed with a dot product between the text_embedding and the video_embedding.

video_embedding = video_output['video_embedding']
text_embedding = text_output['text_embedding']
# We compute all the pairwise similarity scores between video and text.
similarity_matrix = th.matmul(text_embedding, video_embedding.t())

References

[1] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic and A. Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos. https://arxiv.org/abs/1912.06430

[2] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev and J. Sivic, HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. https://arxiv.org/abs/1906.03327

Bibtex:

@inproceedings{miech19howto100m,
   title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips},
   author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef},
   booktitle={ICCV},
   year={2019},
}

@inproceedings{miech19endtoend,
   title={{E}nd-to-{E}nd {L}earning of {V}isual {R}epresentations from {U}ncurated {I}nstructional {V}ideos},
   author={Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew},
   booktitle={CVPR},
   year={2020},
}

Acknowledgements

We would like to thank Yana Hasson for the help provided in the non trivial porting of the original Tensorflow weights to PyTorch.

s3d_howto100m's People

Contributors

Stargazers

Watchers

s3d_howto100m's Issues

Weights link is no longer working

Hi, it seems the link to download the weights is down.

Is there a way to upload to a new link or make it working again?

linking from this issue here for code for our paper as it relies on the S3D pre-trained weights :') sumedh7/RoboCLIP#4

Possible to upload the S3D pretrained weights somewhere other than Baidu

Hi there

Thanks for this great repo.

It seems to be impossible for me to download the weights from Baidu without a Chinese number. Could you possibly upload the RGB weights somewhere else? Even to a separate github repo should be fine as they are only 50mb?

Thanks
Liam

space_to_depth value

In this repository 'space_to_depth' is set to True in S3D model. However in the training repository (https://github.com/antoine77340/MIL-NCE_HowTo100M/blob/) 'space_to_depth' is set to False. Could you confirm if this is intentionally set differently? Thanks!

How to normalize the images?

The images should be normalized to [0, 1], what about the mean and std? For example, the normalization for pytorch pre-trained models is: normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

Fine-tune the model

Hi Antoine,

If I want to fine-tune this model, what's the format of training examples? Do you have a sample code for fine-tuning? Thanks!

Best,
Yue

s3d pretrained model on kinetics

Hi, Antoine,

thanks for open source such a great work! The pretrained weights is very useful. Recently, I am trying compare your weights pretrained on Howto100m and those on kinetics. However, I failed to find the later one on your repo or any other places. I am wondering whether you have pretrained weights on kinetics on your side? Thanks!

best,

Excellent piece of work

I would like to appreciate you for such a great piece of work. Hats off. awesome. !!

Can't reproduce results for YouCookII

I took this model

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

and code from this repository, I take validation part of youcookII and try to achieve numbers
mentioned in the article End-to-End Learning of Visual Representations from Uncurated Instructional Videos

It is unclear which protocol did you use for testing. In the following table I show several experiments and none of them could achieve your results. Could you clarify which test protocol did you use for testing? It will be good if you publish script for testing.

What I try.

T is time in seconds. I split each clip to subclips each has length T seconds. For each subclip embedding will be compute.
pooling If clip was split to >1 subclips embeddings will be averaged to pooling
imgsz Short side of each source video will be rescaled to imgsz with h:w preserving. Then
center crop will be taken for each frame.
normalize Whether or not sentence embedding and each video embedding was L2-normalized before dot-product.
num frames From each T-seconds clip num frames was taken in uniform style.
num resample For each clip sample different num resample sets of frames. For each resample compute embedding. With pooling all embedding will be polled to single one. LCR means sample from each clip 3 times: num frames left crops, num_frames right crops, num frames center crops.

T	imgsize	pooling	normalize	num frames	num resample	R@1	R@5	R@10	MedR
250	200	max	False	32	1	11.478	27.610	37.453	21
250	224	max	False	32	1	8.774	22.044	30.975	32
250	256	max	False	32	1	5.912	15.503	21.038	104
1.5	200	max	False	32	1	8.333	23.208	31.981	31
3.2	200	max	False	32	1	9.497	24.969	34.654	24
8	200	max	False	32	1	10.094	25.818	35.849	23
16	200	max	False	32	1	10.755	26.478	36.541	21
32	200	max	False	32	1	11.164	27.484	37.296	21
64	200	max	False	32	1	11.415	27.704	37.547	21
128	200	max	False	32	1	11.447	27.610	37.453	21
250	200	max	True	32	1	9.906	25.031	34.748	25
250	200	max	False	32	2	11.604	28.270	37.987	20
250	200	max	False	32	3	11.918	28.396	38.333	21
250	200	max	False	32	4	11.509	28.082	38.365	21
250	200	max	False	32	LCR	11.384	27.138	37.704	22
250	200	mean	False	32	4	12.075	28.805	38.459	20

Video configuration for best performance under limited computing resource

Hi Antoine,

First of all, great work! The codes are extremely friendly to use, I'd like to thank your efforts.

I'm trying to use your model as the first step of my own project to extract good features for both video and language. It would be great if you could advise on some doubts I have.

If I understand correctly, the model best performs on video clips under "FPS = 10, 32 frames (3.2 sec)". Due to my limit of computing resource (basically GPU memory), I'd like to downscale this config. What rules should I stick to in this situation? Should I stick to the config of "clip being 3.2 sec", hence video clips be like "FPS = 2.5, 8 frames (3.2 sec)", or should I stick to "FPS being 10", hence use something like "FPS = 10, 8 frames (0.8 sec)"?

Second, to what extend do you recommend finetuning the params of your pretrained MIL-NCE? I think it would be safe to assume that finetuning will always help on downstream tasks, but I have little sense on how much it could help in our case. Maybe you could also advise on this?

Thank you in advance.

Question in Sentence Embedding

I have a question about the Sentence_Embedding model forward implementation.

Why is torch max applied after the first Fully Conbected layer? Is this better than doing it before and averaging all the word embeddings of a sentence before the FC Layers?

Thanks for the clarification

Video preprocessing steps?

Hi @antoine77340,

Thanks for sharing this codebase. I'd like to evaluate your pretrained model on several custom videos but I don't see any code/instruction on how to preprocess the videos for inference. Could you share some insight?

Thanks,

pre-extracted video embedding from joint space

Hi Antoine,

I am impressed by your excellent work which is very helpful to my research!

I would like to know if you have extracted the joint-space feature (512d) for all clips in Howto100M that I can directly download?

I have already downloaded the S3D features for all clips, is there any way to convert these feature vectors to the joint space?

Thank you very much!

Yue

finetune results on UCF101

Hi, I am having some trouble reproducing the finetune resuts on UCF101. I can only get 88.6 while the reported result is 91.3. Could you share your finetune script or procedure/hyper-params? Thanks!

Dramatically accuracy drop with JPG compression

I tested your model on YouCookII with this protocol (4x32 contiguous frames at 10FPS). I extrcated images from video in two ways.

ffmpeg -y -i <INPUT.mp4> -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg
ffmpeg -y -i <INPUT.mp4> -qscale:v 2 -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg

The first one compresses output JPGs, the last one save JPGs with the best quality.

The example on 1, 2 and -qscale:v 31 (poorest quality). Please ignore H/W ratio, in testing I use correct H/W ration.

The difference between 1 and 2 is small.

source	R@1	R@5	R@10	MedR
results in article	15.1	38	51.2	10
my retest. ffmpeg best quality	15.975	38.208	50.126	10
my retest. ffmpeg default quality	10.629	27.201	7.925	20

Note: some videos from YouCookII are unavailable today, so I tested only on available videos.

Despite small difference between 1 and 2, the test difference is sufficient. It may be because some intersection between YouCookII and HowTo100M wasn't filtered, and network learned some videos from this intersection.

My question is. Are you sure that intersection between YouCookII and HowTo100M was completely removed from train dataset? Could you post in this thread youtube video ids that was used for train? (or that was thrown away?). I want to do double check about intersection.

antoine77340 / s3d_howto100m Goto Github PK

s3d_howto100m's Introduction

PyTorch S3D Text-Video trained HowTo100M

Getting the data

How To use it ?

Computing all the pairwise video-text similarities:

References

Acknowledgements

s3d_howto100m's People

Contributors

Stargazers

Watchers

Forkers

s3d_howto100m's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs