GithubHelp home page GithubHelp logo

vsp-llm's Introduction

VSP-LLM (Visual Speech Processing incorporated with LLMs)

This is the PyTorch code for Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing. This code is developed on the code of AV-HuBERT.

Introduction

We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner.

vsr-vst

Model checkpoint

You can find checkpoint of our model in here. Move the checkpoint to checkpoints.

Preparation

conda create -n vsp-llm python=3.9 -y
conda activate vsp-llm
git clone https://github.com/Sally-SH/VSP-LLM.git
cd VSP-LLM
pip install -r requirements.txt
cd fairseq
pip install --editable ./
  • Download AV-HuBERT pre-trained model AV-HuBERT Large (LSR3 + VoxCeleb2) from here.
  • Download LLaMA2-7B from here.

Move the AV-HuBERT pre-trained model checkpoint and the LLaMA2-7B checkpoint to checkpoints.

Data preprocessing

Follow Auto-AVSR preparation to preprocess the LRS3 dataset.
Then, follow AV-HuBERT preparation from step 3 to create manifest of LRS3 dataset.

Generate visual speech unit and cluster counts file

Follow the steps in clustering to create:

  • {train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 25Hz for AV-HuBERT features by default.

Dataset layout

.
├── lrs3
│     ├── lrs3_video_seg24s               # Preprocessed video and audio data
│     └── lrs3_text_seg24s                # Preprocessed text data
├── muavic_dataset                        # Mix of VSR data and VST(En-X) data
│     ├── train.tsv                       # List of audio and video path for training
│     ├── train.wrd                       # List of target label for training
│     ├── train.cluster_counts            # List of clusters to deduplicate speech units in training
│     ├── test.tsv                        # List of audio and video path for testing
│     ├── test.wrd                        # List of target label for testing
│     └── test.cluster_counts             # List of clusters to deduplicate speech units in testing
└── test_data
      ├── vsr
      │    └── en
      │        ├── test.tsv 
      │        ├── test.wrd  
      │        └── test.cluster_counts           
      └── vst
           └── en
               ├── es
               :   ├── test.tsv
               :   ├── test.wrd 
               :   └── test.cluster_counts
               └── pt
                   ├── test.tsv
                   ├── test.wrd 
                   └── test.cluster_counts

Test data

The test manifest is provided in labels. You need to replace the path of the LRS3 in the manifest file with your preprocessed LRS3 dataset path using the following command:

cd src/dataset
python replace_path.py --lrs3 /path/to/lrs3

Then modified test amanifest is saved in dataset

Training

Open the training script (scripts/train.sh) and replace these variables:

# path to train dataset dir
DATA_PATH=???

# path where output trained models will be located
OUT_PATH=???

Run the training script:

$ bash scripts/train.sh

Decoding

Open the decoding script (scripts/decode.sh) and replace these variables:

# language direction (e.g 'en' for VSR task / 'en-es' for En to Es VST task)
LANG=???

# path to the trained model
MODEL_PATH=???

# path where decoding results and scores will be located
OUT_PATH=???

Run the decoding script:

$ bash scripts/decode.sh

vsp-llm's People

Contributors

sally-sh avatar jeonghun0716 avatar eltociear avatar

Stargazers

 avatar  avatar Qizheng Wei avatar  avatar  avatar carlito avatar  avatar  avatar  avatar  avatar Rafferty Chen avatar Bobo avatar  avatar wuwenjie avatar  avatar Jeff Carpenter avatar Guilherme Pereira avatar  avatar Vansh Garg avatar  avatar liangbo.zhou avatar JINGJUN TAO avatar  avatar Zry avatar  avatar kossi avatar  avatar  avatar Poverty avatar Digital Bottle avatar CODAS avatar 东方飞鱼 avatar Gyanateet Dutta avatar  avatar Michel Bretschneider avatar eagle avatar  avatar SeungMin Lee(Lucas) avatar Kalyan Nakka avatar  avatar JunFly avatar Koolen Dasheppi avatar  avatar Kartikey avatar  avatar  avatar 吾爱开源 avatar  avatar Cătălin George Feștilă avatar  avatar  avatar Hao Gu avatar  avatar  avatar Guilherme Euzébio avatar zzp avatar yukang lin avatar Binary Shadow avatar  avatar Justin avatar yunfeng avatar ifredom avatar 秦天鹏 avatar forensicator avatar Dan Presil avatar Kernel avatar  avatar Anciety avatar WoodenPig avatar  avatar  avatar  avatar Synthintel avatar  avatar CornGuo avatar Yuanhang Zhang avatar Pass-O-Guava avatar  avatar John D. Pope avatar  avatar  avatar  avatar  avatar Andrew Rouditchenko avatar  avatar Alireza Hosseini avatar  avatar Bulat Nutfullin avatar BK Lee avatar Jose Cohenca avatar  avatar Zixiong Su avatar jimmy avatar  avatar  avatar wangchunlei avatar Yann Stoneman avatar xzw avatar haruhi avatar  avatar

Watchers

Gary hilgemann avatar Vinicius Ianni avatar yapeng he avatar  avatar  avatar  avatar

vsp-llm's Issues

How do you perform model selection?

Thanks for the great work.

After thoroughly readinging your codes, I'm wondering how you perform model selection.
Did you perform model average on the several last checkpoints? Or did you evaluate each checkpoint on the validation set and them average the several best checkpoints?

I did not find any clue in the codes. Any reply will be appreciated, thanks!

Whre is the demo code?

This is a excellent work, but I can not find the demo code to evaluate the author's model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.