tengdahan / temporalalignnet Goto Github PK

View Code? Open in Web Editor NEW

108.0 13.0 3.0 16.02 MB

[CVPR'22 Oral] Temporal Alignment Networks for Long-term Video. Tengda Han, Weidi Xie, Andrew Zisserman.

License: MIT License

Python 91.36% Shell 8.64%

temporalalignnet's Introduction

Temporal Alignment Networks for long-term Video

Tengda Han, Weidi Xie, Andrew Zisserman. CVPR2022 Oral.

[project page] [PDF] [Arxiv] [Video]

News

[23.08.30] 📣 📣 We released WhisperX ASR output and InternVideo & CLIP-L14 visual features for HowTo100M here.
[22.09.14] Fixed a bug that affects the ROC-AUC calculation on HTM-Align dataset. Other metrics are not affected. Details
[22.09.14] Fixed a few typos and some incorrect annotations in HTM-Align. This download link is up-to-date.
[22.08.04] Released HTM-370K and HTM-1.2M here, the sentencified version of HowTo100M. Thank you for your patience, I'm working on the rest.

TLDR

Natural instructional videos (e.g. from YouTube) has the visual-textual alignment problem, that introduces lots of noise and makes them hard to learn.
Our model learns to predict:
1. if the ASR sentence is alignable with the video,
2. if yes, the most corresponding video timestamps.
Our model is trained without human annotation, and can be used to clean-up the noisy instructional videos (as the output, we release an Auto-Aligned HTM dataset, HTM-AA).
In our paper, we show the auto-aligned HTM dataset can improve the backbone visual representation quality comparing with original HTM.

Datasets (Check project page for details)

HTM-Align: A manually annotated 80-video subset for alignment evaluation.
HTM-AA: A large-scale video-text paired dataset automatically aligned using our TAN without using any manual annotations.
Sentencified HTM: The original HTM dataset but the ASR is processed into full sentences.

Tool

Sentencify-text: A pipeline to pre-process ASR text segments and get full sentences.

Training TAN

See instructions in [train]

Using output of TAN for end-to-end training.

See instructions in [end2end]

Checkpoints of TAN

HTM370K-E6D6-Stage2

Reference

@InProceedings{han2022align,
  title={Temporal Alignment Network for long-term Video},  
  author={Tengda Han and Weidi Xie and Andrew Zisserman},  
  booktitle={CVPR},  
  year={2022}}

temporalalignnet's People

Contributors

Stargazers

Watchers

Forkers

tomchen-ctj g1910 jackzhousz

temporalalignnet's Issues

We are looking forward to your code! Wink!

About the modelzoo

Hello, thank you for the excllcent work. Can you provide the pretrained models in the modelzoo?

The README.md says "We aim to release a model zoo for both TAN variants and end-to-end visual representations in July", but I can not find them

Could you release the code for evaluation on HTM-AA?

Hi, could you release the code for evaluation on HTM-AA?

About pre-training hardware

Hi,

Thank you for your solid work and code releases.
May I ask how many GPUs are used in the basic TAN pre-training? (Sorry if I missed that description in your paper)

Thanks!

Why is my count for the number of videos in htm_aa_v1.csv 247, 564 instead of 370K?

Hi, I downloaded htm_aa_v1.csv from the Oxford server you given, I used np.unique to count the video list and found only 247, 564 videos but not 370k videos. Is there something wrong?

About video timestamps in HTM_AA dataset

Hi, thank you for the fantastic work! I have some questions about the timestamp in the HTM_AA dataset and hope you can help me.

As mentioned in issue #3, the timestamp provided by your model is the center timestamp of the sentence.
And the shifted timestamps have the same length as the original ASR timestamp, which is mentioned in sec 3.4.2 of your paper. To build the timestamp in the following format: [t_start, t_end], I use the following steps:

For the i-th sentence in video V:

$t^i = [t^i_{center} - l^i_v / 2, t^i_{center} + l^i_v / 2]$

, where $l$ is the timestamp duration queried from HTM-1.2M and $t_{center}$ is queried from HTM_AA.

But I found that not all sentences in HTM_AA can be queried from HTM-1.2M. Could you please explain it? (Sorry if I missed some details in your paper)

negative pairs

Hello, thank you for the excllcent work. Though I have a few conerns about the first stage of training with contrastive loss function. I notice that this work considers the negative pairs inside the video, which is different to the conventional way that considers the other samples in batch. I am wondering if I am understanding it correctly and if I am, do you analyze the evaluation performance with only the pretraining stage? Thank you in advance!

video timestamps

Hey, thank you for the amazing work!

How should we interpret the video timestamps?

There are many cases where the timestamp is off by a lot. For example, in the oatmgMKmbmA video:
oatmgMKmbmA,5,my favorite around the birthday cake timestamp at 5s but if we listen to the video it only happens after second 25.
oatmgMKmbmA,16,so the bakery we usually learn on whipped cream and then graduate to buttercream happens at second 20 (before the previous row).

Any help regarding this topic will be appreciated. Thanks!

Training commands

Dear author, could you please release the training commands soon?