GithubHelp home page GithubHelp logo

Comments (6)

gabeur avatar gabeur commented on August 11, 2024 1

This is how we precompute the s3d features:
Each segment is 1 second long with no overlapping, the FPS is kept to 30.
So each segment has 30-frames, the input size is 30x224x224x3, the output size of S3D is averaged to 1x1x1x1024.

Your results look pretty close. I think it is important to report the average results over several experiments to draw conclusions because there is important variation with respect to the random seed.

from mmt.

gabeur avatar gabeur commented on August 11, 2024 1

In the h5 files, your provided under the folder of vid_feat_files/mult_h5, the data has keys of features.vggish, and features.audio. Is there any difference between those two features? Are they both used by the model?

features.audio are the audio features extracted by the authors of CE.
features.vggish are the audio features extracted by us.
We only use the features.vggish audio features for the results reported in the paper.

Did you use the default way to extract vggish features same as mentioned in the CE paper?

For obtaining the features.vggish, we used the same approach as the authors of CE except that our window size is 1.0s

from mmt.

gabeur avatar gabeur commented on August 11, 2024

Sorry, we cannot share the features extraction code.
The checkpoint to extract the S3D features is available here.

from mmt.

lininglouis avatar lininglouis commented on August 11, 2024

Sorry, we cannot share the features extraction code.
The checkpoint to extract the S3D features is available here.

Cool! It's enough. Thanks for your instant reply.

from mmt.

lininglouis avatar lininglouis commented on August 11, 2024

Hi, Gabeur,
May I know the way you precompute the S3D features?
According to the pentathlon challenge. "Frames are extracted at 10fps and processed in clips of 32 frames with a stride of 25 frames." pentathlon
But i dont think you use this way, because the number of S3D features(1024 features) you calculate for each video is similar to the video duration(for example, a video of 11 seconds will have S3D features in the dimension of (11, 1024) in MMT.

I'm wondering how you sample and extract the S3D features. I tried two ways to extract S3D. Here is the result.
s3d result

The S3D I used is from model.(the d3d model you provided earlier is corrupted somehow, i cannot load the pretrained weights, so I switch to this S3D version)

As you can see, there still remains a gap. It could be the problem of the S3D model i used. It could also be the way I extract the S3D feature is different from yours. Could you give some advice? Thanks!

from mmt.

lininglouis avatar lininglouis commented on August 11, 2024

This is how we precompute the s3d features:
Each segment is 1 second long with no overlapping, the FPS is kept to 30.
So each segment has 30-frames, the input size is 30x224x224x3, the output size of S3D is averaged to 1x1x1x1024.

Your results look pretty close. I think it is important to report the average results over several experiments to draw conclusions because there is important variation with respect to the random seed.

Hi Gabeur, we use the way you suggested, and the performance of S3D feature is similar now. Thanks a lot!

But we met with some problems in terms of the audio features(vggish). There are two questions and hope you could help.

  1. In the h5 files, your provided under the folder of vid_feat_files/mult_h5, the data has keys of features.vggish, and features.audio. Is there any difference between those two features? Are they both used by the model?

  2. Did you use the default way to extract vggish features same as mentioned in the CE paper?

vggish

I noticed that, according to CE paper or vggish tensorflow repo, the audio features should be parsed into non-overlapping 0.96s collections of frames. But in the MMT expert_timings.py, the expert_timing of vggish has feat_width of 1.0. It looks like you parse the audio features with 1.0s per collections of frames.

Since there is 0.04s difference, did you resample the data or align the vggish features? If so, may I know how the vggish feature was calculated? Please correct me if my understanding is not right.

Many thanks for your help!

from mmt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.