GithubHelp home page GithubHelp logo

galaxycong / hpmdubbing Goto Github PK

View Code? Open in Web Editor NEW
97.0 9.0 7.0 4.03 MB

[CVPR 2023] Official code for paper: Learning to Dub Movies via Hierarchical Prosody Models.

License: MIT License

Python 100.00%

hpmdubbing's Introduction

HPMDubbing🎬 - PyTorch Implementation

In this paper, we propose a novel movie dubbing architecture via hierarchical prosody modeling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by the psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings are used together to generate mel-spectrogram, which is then converted into speech waves by an existing vocoder. Extensive experimental results on the V2C and Chem benchmark datasets demonstrate the favourable performance of the proposed method.

Recent Updates

[11/11/2023] Uploading the feature of chemistry lecture (Chem) dataset, the sample rating is 160000Hz.

[18/11/2023] Uploading pre-trained model and complementing missing details to ensure inference successfully.

[25/11/2023] An explanation about the challenge with the V2C-Animation dataset.

[3/12/2023] We plan to share the image areas of the mouth and face we extracted for readers to use conveniently.

Dataset Frames (25 FPS) Face image Mouth image
Chem BaiduDrive (fram) GoogleDrive, BaiduDrive (face) GoogleDrive, BaiduDrive (mout)
V2C 2.0 Download* BaiduDrive (Ours) BaiduDrive (Ours)

I'm sorry due to the copyright issues, V2C frame will not be made public for now.

[10/12/2023] Uploading more details about the data preprocess and script codes.

Before [8/7/2023] Release the GRID dataset (extracted feature and split list for train and test).

Before [15/7/2023] Publish the source codes and model of StyleDubber.


🌟 Below is the generated result of our method on Chem dataset:

our_result2.mp4

📝Text: so the reaction quotient is actually just a reaction product, the product of the two ions.

our_result3.mp4

📝Text: now, there's also nitrogen in the flask, but it doesn't matter.

our_result4.mp4

📝Text: so we can make that easy connection between a wave and its length by the color that we see.

our_result1.mp4

📝Text: each gas will exert what's called a partial pressure.


🌟 Below is the generated result of our method on V2C dataset:

🌟 Here, we also provide the resulting demo by V2C-Net for comparison. This result is reproduced by the official code of V2C-Net correctly.

Compare1.mp4

📝Text: hey, are you okay?

🎬Source: TinkerII@Terence

Compare2.mp4

📝Text: i'm fishing!

🎬Source: CloudyII@Flint

Compare3.mp4

📝Text: well, thank you.

🎬Source: Ralph@Vanellope

Compare4.mp4

📝Text: Yes. I'm the baby Jesus.

🎬Source: Bossbaby@BossBaby

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Dataset

1) For V2C

V2C-MovieAnimation is a multi-speaker dataset for animation movie dubbing with identity and emotion annotations. It is collected from 26 Disney cartoon movies and covers 153 diverse characters. Due to the copyright, we can not directly provide the dataset, see V2C issue.

In this work, we release the V2C-MovieAnimation2.0 to satisfy the requirement of dubbing the specified characters. Specifically, we removed redundant character faces in movie frames (please note that our video frames are sampled at 25 FPS by ffmpeg). You can download our preprocessed features directly through the link GoogleDrive or BaiduDrive (password: Good). Illustration

2) For Chem

The Chem dataset is provided by Neural Dubber, which belongs to the single-speaker chemistry lecture dataset from Lip2Wav.

In this work, we provide our feature of the Chem dataset, you can download it BaiduDrive (password: chem) / GoogleDrive. To ensure the speech content is guided by the video information like lip movement, in our preprocess, we found some clips that solely contained PowerPoint slides without the presence of an instructor's face, so we removed them further. Please note that a 16kHz vocoder is used to generate waveforms for Chem. image

Data Preparation

For voice preprocessing (mel-spectrograms, pitch, and energy), Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alternatively, you can skip the below-complicated step, and use our extracted features, directly.

Download the official Montreal Forced Aligner (MFA) package and run

./montreal-forced-aligner/bin/mfa_align /data/conggaoxiang/HPMDubbing/V2C_Data/wav16 /data/conggaoxiang/HPMDubbing/lexicon/librispeech-lexicon.txt  english /data/conggaoxiang/HPMDubbing/V2C_Code/example_V2C16/TextGrid -j

then, please run the below script to save the .npy files of mel-spectrograms, pitch, and energy from two datasets, respectively.

python V2C_preprocess.py config/MovieAnimation/preprocess.yaml
python Chem_preprocess.py config/MovieAnimation/preprocess.yaml

For hierarchical visual feature preprocessing (lip, face, and scenes), we detect and crop the face from the video frames using $S^3FD$ face detection model. Then, we align faces to generate 68 landmarks and bounding boxes (./landmarks and ./boxes). Finally, we get the mouth ROIs from all video clips, following EyeLipCropper. Similarly, you can also skip the complex steps below and directly use the features we extracted.

We use the pre-trained weights of emonet to extract affective display features, and fine-tune Arousal and Valence (dimension256) according to the last layer of emonet network.

python V2C_emotion.py -c emonet_8.pth -o /data/conggaoxiang/V2C_feature/example_V2C_framelevel/MovieAnimation/VA_feature -i /data/conggaoxiang/detect_face 

The lip feature is extracted by resnet18_mstcn_video, which inputs the grayscale mouth ROIs for each video.

python lip_main.py --modality video --config-path /data/conggaoxiang/lip/Lipreading_using_Temporal_Convolutional_Networks-master/configs/lrw_resnet18_mstcn.json --model-path /data/conggaoxiang/lip/Lipreading_using_Temporal_Convolutional_Networks-master/models/lrw_resnet18_mstcn_video.pth --data-dir /data/conggaoxiang/lip/Lipreading_using_Temporal_Convolutional_Networks-master/MOUTH_processing --annonation-direc /data/conggaoxiang/lip/LRW_dataset/lipread_mp4 --test

Finally, the scenes feature is provided by V2C-Net from I3D model.

python ./emotion_encoder/video_features/emotion_encoder.py

Vocoder

We provide the pre-trained model and implementation details of HPMDubbing_Vocoder. Please download the vocoder of HPMDubbing and put it into the vocoder/HiFi_GAN_16/ or /vocoder/HiFi_GAN_220/ folder. Before running, remember to check line 63 of model.yaml and change it to your own path.

vocoder:
  model: [HiFi_GAN_16] or [HiFi_GAN_220]
  speaker: "LJSpeech" 
  vocoder_checkpoint_path: [Your path]

Training

For V2C-MovieAnimation dataset, please run train.py file with

python train.py -p config/MovieAnimation/preprocess.yaml -m config/MovieAnimation/model.yaml -t config/MovieAnimation/train.yaml -p2 config/MovieAnimation/preprocess.yaml

For Chem dataset, please run train.py file with

python train.py -p config/Chem/preprocess.yaml -m config/Chem/model.yaml -t config/Chem/train.yaml -p2 config/Chem/preprocess.yaml

Pretrained models

In this work, we will provide pre-trained models (including parameters of networks and optimizer) and training log files of two dubbing datasets, V2C and Chem, to help you complete the inference.

Dubbing Dataset Pre-trained model Vocoder Training log
Chem (chemistry lecture dataset) Download: GoogleDrive or BaiduDrive (password: q44c) 16KHz More Details Download (831c)
V2C (V2C-Animation Dataset) Download: GoogleDrive or BaiduDrive (password: dxyv) 22KHz More Details Download (phxe)
python Synthesis.py --restore_step [Chekpoint] -p config/MovieAnimation/preprocess.yaml -m config/MovieAnimation/model.yaml -t config/MovieAnimation/train.yaml -p2 config/MovieAnimation/preprocess.yaml

Tensorboard

Use

tensorboard --logdir output/log/MovieAnimation --port= [Your port]

or

tensorboard --logdir output/log/Chem --port= [Your port]

to serve TensorBoard on your localhost. The loss curves, mcd curves, synthesized mel-spectrograms, and audios are shown.

Some Q&A

Q&A: Why is the Synchronization_coefficient set 4, can I change it using another positive integer? Follow the Formula: n = \frac{T_{mel}}{T_v}=\frac{sr/hs}{FPS} \in \mathbb{N}^{+}. e.g., in our paper, for chem dataset, we set the sr == 16000Hz, hs == 160, win == 640, FPS == 25, so n is 4. for chem dataset, we set the sr == 22050Hz, hs == 220, win == 880, FPS == 25, so n is 4.009. (This is the meaning of the approximately equal sign in the article).

Q&A: Why the use the different sr for two datasets? Because, we need to keep the same the expriment setting with original paper. Specifically, for Chem dataset ===> NeuralDubber used the 16000Hz in their expriment. for V2C dataset (chenqi et.al) ====> V2C-Net used the 22050Hz as their result. Next step, we have a plan to provide the V2C dataset (16kHz, 24KHz) Version, or Chem dataset (22050Hz, 24KHz) version.

Q&A: Why did you provide two specialized Vocoders? Can I use the official HiFiGAN pre-trained model to replace them? In official HiFiGAN, sr is 22050Hz, hop_size is 256, win_size is 1024. So undering this setting, we suggest to use our Vocoder to satify above Formula. We have released our pre-train model (HPM_Chem, HPM_V2C), you can download it. .....

Acknowledgement

Citation

If our research and this repository are helpful to your work, please cite with:

@inproceedings{cong2023learning,
  title={Learning to Dub Movies via Hierarchical Prosody Models},
  author={Cong, Gaoxiang and Li, Liang and Qi, Yuankai and Zha, Zheng-Jun and Wu, Qi and Wang, Wenyu and Jiang, Bin and Yang, Ming-Hsuan and Huang, Qingming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={14687--14697},
  year={2023}
}

hpmdubbing's People

Contributors

galaxycong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpmdubbing's Issues

关于代码的一些细节问题

作者您好,首先非常感谢您之前的issues解答。看到您最近又更新了code,但是对更新的有些细节暂时还没有看懂。希望可以加您微信,详细请教一下。我的微信号是:
万分感谢您在百忙之中阅读,期待您的回复!

About

I tried to download the movies from the link which is provied in V2C. All the movies seems to be protected and they can't be edited.

googleDrive download- pre-trained model & VA_feature

How have you been?
I'm shamelessly creating issues. I started in February and
now I'm using Emonet to create Feature 256 (this is all I have left)
Of course, I'm half-doubtful that this is made to be right..
What's curious about this part is that emotions are recognized for every cropped face image in the frame, but other features have npy for one image (basename), right?
I don't know how VA_Feature(256dim) will be configured for successive scenes according to the entire basename.
Frame/{basename}/00001.jpg or Cropped Face/{basename}/00001.jpg -> 1 VA Feature(256Dim)
Do you make VA_features as an array for basename? Or, um, I want to know how you organized it.

so,, For this reason...
I need your pretrained model to get insight, could you please provide it?
Of course, I would really appreciate it if you could explain the VA Feature as well.

The provided google drive link needs your permission.
In the case of Baidu, I am a foreigner, so it is not easy to use.

I am finding it difficult to perform data preprocessing on my custom video.

Dear Author ,

Thank you for your amazing work.

I am really interested to inference your model on my custom dataset and currently I am unable to understand the role of the MFA and how to do inference on my custom dataset , Can you please guide me how I can do inference on a custom video ?

Thanks in advance .

I think it is wrong code, please confirm this

first, I did run dataset.py for preparing dataset.
and than,,
um, I found that line number 377 in dataset.py is wrong.
because, in your code...
there is not utils/utils.
so I think that from utils.tools import to_device.
is this correct?

also, I can't find this location
open("./config/LJSpeech/preprocess.yaml", "r")
where is it?
and,, what is it?

预训练模型

非常优秀的工作,但是
我在用作者提供的预训练模型来合成音频时,几乎全是电流,这是为什么呢?
Mel图直接崩了。
这是val set合成出的音频:image
这是使用的预训练参数:
image

后面,我自己重新训使用作者提供的预处理好的特征去重新训了模型,这是在30000步保存的模型合成的结果。
image
image

效果虽然还不行,但是至少不是全是电流音。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.