GithubHelp home page GithubHelp logo

tengdahan / temporalalignnet Goto Github PK

View Code? Open in Web Editor NEW
108.0 13.0 3.0 16.02 MB

[CVPR'22 Oral] Temporal Alignment Networks for Long-term Video. Tengda Han, Weidi Xie, Andrew Zisserman.

License: MIT License

Python 91.36% Shell 8.64%

temporalalignnet's Introduction

Temporal Alignment Networks for long-term Video

Tengda Han, Weidi Xie, Andrew Zisserman. CVPR2022 Oral.

[project page] [PDF] [Arxiv] [Video]

News

  • [23.08.30] ๐Ÿ“ฃ ๐Ÿ“ฃ We released WhisperX ASR output and InternVideo & CLIP-L14 visual features for HowTo100M here.
  • [22.09.14] Fixed a bug that affects the ROC-AUC calculation on HTM-Align dataset. Other metrics are not affected. Details
  • [22.09.14] Fixed a few typos and some incorrect annotations in HTM-Align. This download link is up-to-date.
  • [22.08.04] Released HTM-370K and HTM-1.2M here, the sentencified version of HowTo100M. Thank you for your patience, I'm working on the rest.

TLDR

  • Natural instructional videos (e.g. from YouTube) has the visual-textual alignment problem, that introduces lots of noise and makes them hard to learn.
  • Our model learns to predict:
    1. if the ASR sentence is alignable with the video,
    2. if yes, the most corresponding video timestamps.
  • Our model is trained without human annotation, and can be used to clean-up the noisy instructional videos (as the output, we release an Auto-Aligned HTM dataset, HTM-AA).
  • In our paper, we show the auto-aligned HTM dataset can improve the backbone visual representation quality comparing with original HTM.

Datasets (Check project page for details)

  • HTM-Align: A manually annotated 80-video subset for alignment evaluation.
  • HTM-AA: A large-scale video-text paired dataset automatically aligned using our TAN without using any manual annotations.
  • Sentencified HTM: The original HTM dataset but the ASR is processed into full sentences.

Tool

  • Sentencify-text: A pipeline to pre-process ASR text segments and get full sentences.

Training TAN

Using output of TAN for end-to-end training.

Checkpoints of TAN

Reference

@InProceedings{han2022align,
  title={Temporal Alignment Network for long-term Video},  
  author={Tengda Han and Weidi Xie and Andrew Zisserman},  
  booktitle={CVPR},  
  year={2022}}

temporalalignnet's People

Contributors

tengdahan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

temporalalignnet's Issues

About the modelzoo

Hello, thank you for the excllcent work. Can you provide the pretrained models in the modelzoo?

The README.md says "We aim to release a model zoo for both TAN variants and end-to-end visual representations in July", but I can not find them

About pre-training hardware

Hi,

Thank you for your solid work and code releases.
May I ask how many GPUs are used in the basic TAN pre-training? (Sorry if I missed that description in your paper)

Thanks!

About video timestamps in HTM_AA dataset

Hi, thank you for the fantastic work! I have some questions about the timestamp in the HTM_AA dataset and hope you can help me.

As mentioned in issue #3, the timestamp provided by your model is the center timestamp of the sentence.
And the shifted timestamps have the same length as the original ASR timestamp, which is mentioned in sec 3.4.2 of your paper. To build the timestamp in the following format: [t_start, t_end], I use the following steps:

For the i-th sentence in video V:

$t^i = [t^i_{center} - l^i_v / 2, t^i_{center} + l^i_v / 2]$

, where $l$ is the timestamp duration queried from HTM-1.2M and $t_{center}$ is queried from HTM_AA.

But I found that not all sentences in HTM_AA can be queried from HTM-1.2M. Could you please explain it? (Sorry if I missed some details in your paper)

negative pairs

Hello, thank you for the excllcent work. Though I have a few conerns about the first stage of training with contrastive loss function. I notice that this work considers the negative pairs inside the video, which is different to the conventional way that considers the other samples in batch. I am wondering if I am understanding it correctly and if I am, do you analyze the evaluation performance with only the pretraining stage? Thank you in advance!

video timestamps

Hey, thank you for the amazing work!

How should we interpret the video timestamps?

There are many cases where the timestamp is off by a lot. For example, in the oatmgMKmbmA video:
oatmgMKmbmA,5,my favorite around the birthday cake timestamp at 5s but if we listen to the video it only happens after second 25.
oatmgMKmbmA,16,so the bakery we usually learn on whipped cream and then graduate to buttercream happens at second 20 (before the previous row).

Any help regarding this topic will be appreciated. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.