GithubHelp home page GithubHelp logo

asdnet's Introduction

ASDNet

Pytorch implementation of the article How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild

Figure 1.Audio-visual active speaker detection pipeline. The task is to determine if the reference speaker at frame t is speaking or not-speaking. The pipeline starts with audio-visual encoding of each speaker in the clip. Secondly, inter-speaker relation modeling is applied within each frame. Finally, temporal modeling is used to capture long-term relationships in natural conversations. Examples are from AVA-ActiveSpeaker.

Requirements

  • To create conda environment and install required libraries, please run ./scripts/dev_env.sh.

Dataset Preparation

  • Run ./scripts/dowloads.sh in order to download 3 utility files, which is necessary to preprocess AVA-ActiveSpeaker dataset.
  1. Download AVA videos from https://github.com/cvdfoundation/ava-dataset.

  2. Extract the audio tracks from every video in the dataset. Go to ./data/extract_audio_tracks.py in main adapt the ava_video_dir (directory with the original ava videos) and target_audios (empty directory where the audio tracks will be stored) to your local file system.

  3. Slice the audio tracks by timestamp. Go to ./data/slice_audio_tracks.py in main adapt the ava_audio_dir (the directory with the audio tracks you extracted on step 1), output_dir (empty directory where you will store the sliced audio files) and csv (the utility file you download previously, use the set accordingly) to your local file system.

  4. Extract the face crops by timestamp. Go to ./data/extract_face_crops_time.py in main adapt the ava_video_dir (directory with the original ava videos), csv_file (the utility file you download previously, use the train/val/test set accordingly) and output_dir (empty directory where you will store the face crops) to your local file system. This process will result in about 124GB extra data.

The full audio tracks obtained on step 2. will not be used anymore.

Audio-Visual Encoding (AV_Enc): Training, Feature Extraction, Postprocessing and Results

Training

Audio-visual encoders can be trained with the following command:

python main.py --stage av_enc \
	--audio_backbone sincdsnet \
	--video_backbone resnext101 \
	--video_backbone_pretrained_path /usr/home/kop/ASDNet/weights/kinetics_resnext_101_RGB_16_best.pth \
	--epochs 70 \
	--step_size 30 \
	--av_enc_learning_rate 3e-4 \
	--av_enc_batch_size 24 \

Feature Extraction

Use --forward to enable feature extraction, and use --resume_path to specify which saved model path to use at feature extraction.

Postprocesssing

Use --postprocessing to enable postprocessing, which produces final/AV_Enc.csv and final/gt.csv.

Getting Results

Use following command to get AV_Enc results:

python get_ava_active_speaker_performance.py -p final/AV_Enc.csv -g final/gt.csv

Temporal Modeling and Inter-Speaker Relation Modeling (TM_ISRM): Training, Feature Extraction and Postprocessing

Training

TM and ISRM stages can be trained with the following command:

python main.py --stage tm_isrm \
	--epochs 10 \
	--step_size 5 \
	--av_enc_learning_rate 3e-6 \
	--av_enc_batch_size 256 \

For validation results, there is no need to extract features and apply postprocessing. Training script directly produces mAP results for active speaking class. However, the following feature extraction, postprocessing scripts would be usefull for test set.

Feature Extraction

Use --forward to enable feature extraction, and use --resume_path to specify which saved model path to use at feature extraction.

Postprocesssing

Use --postprocessing to enable postprocessing, which produces final/TM_ISRM.csv and final/gt.csv.

Citation

If you use this code or pre-trained models, please cite the following:

@article{kopuklu2021asdnet,
  title={How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild},
  author={K{\"o}p{\"u}kl{\"u}, Okan and Taseska, Maja and Rigoll, Gerhard},
  journal={arXiv preprint arXiv:2106.03932},
  year={2021}
}

Acknowledgements

We thank Juan Carlos Leon Alcazar for releasing active-speakers-context codebase, from which we use dataset preprocessing and data loaders.

asdnet's People

Contributors

okankop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

asdnet's Issues

Missing facecrops for trainval videos

After running data/extract_face_crops_time.py, there were 24776 facecrops missing from train videos and 7744 facecrops missing from val videos.

If this is a known issue or you have faced it, can you please tell.
I have attached a summary. Please rename it to ".yaml"

This issue is not with any of the test videos.

Thanks
missing-facecrop.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.