GithubHelp home page GithubHelp logo

akashe / multimodal-action-recognition Goto Github PK

View Code? Open in Web Editor NEW
70.0 1.0 13.0 66.25 MB

Code on selecting an action based on multimodal inputs. Here in this case inputs are voice and text.

Python 100.00%
multimodal-deep-learning multimodality multimodal-fusion multimodal-learning multimodal-data multimodal-action-recognition cross-attention

multimodal-action-recognition's Introduction

Multimodal action recognition:

Task Definition

Given an audio snippet and a transcript, classify the referred 'action', 'object' and 'position/location' information present in them.

Set up virtual environment

To set up a python virtual environment with the required dependencies:

python3 -m venv multimodal_classification
source multimodal_classification/bin/activate
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt --use-feature=2020-resolver
python -m spacy download en

Train & eval

Set the location of downloaded wav files in 'wavs_location' in config.yaml. You can also set location of custom train and eval files in config.yaml.

To train the model run

python train.py --config config.yaml

where, config.yaml consists the location of train. valid scripts, saved_model_location and other hyper-parameters.

To eval

python eval.py --config config.yaml

While evaluating, the model can accept csv files with their location mentioned in config file. It won't support single sentence inference because it would need corresponding audio sample also.

To eval with your own csv file. Copy the file in data folder and update the 'valid_file' name parameter in the config.

Tensorboard & logs

The logs are present in 'logs/'.

To visualize using tensorboard use event files in 'runs/'. The sub-folders in the 'runs/' folder are the experiment name which you set in config as 'log_path'

tensorboard --logdir path_to_tensorboard_logs

Tensorboard logs are present in config.log_path and the specific mode you are running training the model in

Model

We use a multimodal model here. The model consists of 3 main components:

  1. Audio self-attention: These layers calculate self attention among the audio signals. We take the original audio len and split in equal parts controlled by the parameter audio_split_samples. So, if the original audio len was 60000 and audio_split_samples = 1000 then we divide the audio into 60 tokens.
  2. Text self-attention: These layers find self attention in the text representations.
  3. Cross- attention: After getting text and audio representation we find cross attention between them and use the results for prediction.

Each layer has the following sequence of operations:

  1. Calculate attention. (Note: In case of cross-attention, we use audio representations as key and value values and use them to find attention over text representations which we set as query)
  2. LayerNorm + residual connection
  3. Pointwise Feedforward.
  4. LayerNorm + residual connection.

We referred https://arxiv.org/pdf/2104.11178v1.pdf and directly encode audio for transformers and didn't use Mel Spectogram or other feature extractor.

Results

Result with 3 audio_representation_layers 2 text_representation_layers 2 cross_attention_layers

We get an average validation f1 of 1.

  1. 'action_f1': 1.0 and 'action_accuracy' :100 %
  2. 'object_f1': 1.0 and 'object_accuracy' :100 %
  3. 'position_f1': 1.0 and 'position_accuracy' :100 %

check logs/train_logs.log line 933 and can also refer eval_logs.log

multimodal-action-recognition's People

Contributors

akashe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

multimodal-action-recognition's Issues

Do I need position Embedding before CrossAttentionLayer?

Well done, I've also been studying multimodal tasks recently, if I have video frame level embedding (bs,len,dim) as well as text embedding that need to be done position Embedding before CrossAttentionLayer? Looking forward to your reply.

run train.py and report an error

Thank for your work .While running train.py,report an error.
OSError: [E050] Can't find model 'en_core_web_sm'.
Excuse me, what is the reason?

run train.py and report an error

Thank for your work .While running train.py,report an error.
OSError: [E050] Can't find model 'en_core_web_sm'.
Excuse me, what is the reason?

Where to download the wave files?

Hello, I have seen the Readme file that downloaded data should be put according to where the csv file illustrates, but I did not find out where to download these wav dataseds o(╥﹏╥)o

environment issues

hi,I‘d like to ask you about the detailed infomation about the required environment ,such as torch version.
I'm a rookie and still studying in this domain so forgive my stupied question.Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.