GithubHelp home page GithubHelp logo

eltociear / vsp-llm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sally-sh/vsp-llm

0.0 1.0 0.0 18.75 MB

License: Other

Shell 3.87% C++ 0.54% Python 94.30% Lua 0.09% Cuda 0.87% Makefile 0.01% Batchfile 0.02% Cython 0.29%

vsp-llm's Introduction

VSP-LLM (Visual Speech Processing incorporated with LLMs)

This is the PyTorch code for Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing. This code is developed on the code of AV-HuBERT.

  • add colab demo

Introduction

We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner.

vsr-vst

Model checkpoint

You can find checkpoint of our model in here

Demo

Try our VSP-LLM demo using colab

Preparartion

conda create -n vsp-llm python=3.9 -y
conda activate vsp-llm
git clone https://github.com/Sally-SH/VSP-LLM.git
cd VSP-LLM
pip install -r requirements.txt
  • Download AV-HuBERT pre-trained model AV-HuBERT Large (LSR3 + VoxCeleb2) from here.
  • Download LLaMA2-7B from here.

Data preprocessing

Follow Auto-AVSR preperation to preprocess the LRS3 dataset.
Then, follow AV-HuBERT preperation from step 3 to create manifest of LRS3 dataset.

Generate visual speech unit and cluster counts file

Follow the steps in clustering (pre-train only) to create:

  • {train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 25Hz for AV-HuBERT features by default.

Dataset layout

.
├── lrs3_video_seg24s                     # preprocessed video and audio data
├── lrs3_text_seg24s                      # preprocessed text data
└── lrs3_dataset                          
      ├── train.tsv                       # List of audio and video path for training
      ├── train.wrd                       # List of target label for training
      ├── train.cluster_counts            # List of clusters to deduplicate speech units in training
      ├── test.tsv                        # List of audio and video path for testing
      ├── test.wrd                        # List of target label for testing
      └── test.cluster_counts             # List of clusters to deduplicate speech units in testing

Training

Open the training script (scripts/train.sh) and replace these variables:

# path to downloaded pre-trained avhubert
PRETRAINED_MODEL_PATH=???

# path to train dataset dir
DATA_PATH=???

# path to llama checkpoint
LLM_PATH=???

# path where output trained models will be located
OUT_PATH=???

Run the training script:

$ bash scripts/train.sh

Decoding

Open the decoding script (scripts/decode.sh) and replace these variables:

# language direction (e.g "en" or "en-fr")
LANG=???

# path to the trained model
MODEL_PATH=???

# path to test dataset dir
DATA_PATH=???

# path to llama checkpoint
LLM_PATH=???

# path where decoding results and scores will be located
OUT_PATH=???

Run the decoding script:

$ bash scripts/decode.sh

vsp-llm's People

Contributors

sally-sh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.