The multimodalsr from 00mjk

This is the repository containing most of the code for my thesis 'Design, Implementation and Analysis of a Deep Convolutional-Recurrent Neural Network for Speech Recognition throuth Audiovisual Sensor Fusion' at the ESAT (Electrical Engineering) Department of KU Leuven (2016-2017).

Author: Matthijs Van keirsbilck
Supervisor: Bert Moons
Promotor: Marian Verhelst

The code and thesis text are bound by the KU Leuven's Student Thesis Copyright Regulations.

The CNN-LSTM networks for lipreading are combined with LSTM networks for audio recognition through an attention mechanism.
These networks achieve state-of-the-art phoneme recognition performance on the publicly available audio-visual dataset TCD-TIMIT. Systems that rely only audio suffer greatly when audio quality is lowered by noise, as is often the case in real-life situations.
This performance loss can be greatly mitigated by adding visual information.
The CNN-LSTM neural networks acieve 68.46% correctness compared to the 57.85% baseline.
Audio-only neural networks achieve 67.03% compared to 65.47% in the baseline.
Lipreading-audio combination networks achieve 75.70% accuracy for clean audio, and 58.55% for audio with an SNR of 0dB. The baseline multimodal network achieved 59% and 44% for clean and noisy audio, respectively.

The networks are implemented using Lasagne.
There is room for improvement of the code; I'll try to improve it if I can find the time.

For the downloading, preprocessing etc of the dataset: see https://github.com/matthijsvk/TCDTIMITprocessing
For the lipreading networks, see the folder code/lipreading
For the audio speech recognition networks, see code/audioSR
For the combination networks see code/combinedSR

Thanks to the authors of all the data and software used in this work. An inexhaustive list:

To Set up Python, I recommend using Anaconda. You can use the provided environment.yml to install all python packages (although some aren't used anymore).
For the installation of Theano/Lasagne and CUDA, I recommend following this tutorial.

If you find this thesis or code useful, please cite according to the following bib entry

@MastersThesis{Vankeirsbilck:Thesis:2017,
    author     =     {Matthijs Van keirsbilck},
    title     =     {{Design, implementation and analysis of a deep convolutional-recurrent neural network for speech recognition through audiovisual sensor fusion}},
    school     =     {KU Leuven},
    address     =     {Belgium},
    year     =     {2017},
    }

00mjk / multimodalsr Goto Github PK

multimodalsr's Introduction

multimodalsr's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs