GithubHelp home page GithubHelp logo

00mjk / multimodalsr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from matthijsvk/multimodalsr

0.0 0.0 0.0 632.99 MB

Multimodal speech recognition using lipreading (with CNNs) and audio (using LSTMs). Sensor fusion is done with an attention network.

License: MIT License

Shell 0.16% JavaScript 0.02% C++ 0.61% Python 14.68% Perl 0.50% C 12.99% MATLAB 0.21% CSS 6.14% TeX 5.52% Cuda 0.09% Makefile 0.24% Jupyter Notebook 58.81% M4 0.03%

multimodalsr's Introduction

This is the repository containing most of the code for my thesis 'Design, Implementation and Analysis of a Deep Convolutional-Recurrent Neural Network for Speech Recognition throuth Audiovisual Sensor Fusion' at the ESAT (Electrical Engineering) Department of KU Leuven (2016-2017).

Author: Matthijs Van keirsbilck
Supervisor: Bert Moons
Promotor: Marian Verhelst

The code and thesis text are bound by the KU Leuven's Student Thesis Copyright Regulations.


The CNN-LSTM networks for lipreading are combined with LSTM networks for audio recognition through an attention mechanism.
These networks achieve state-of-the-art phoneme recognition performance on the publicly available audio-visual dataset TCD-TIMIT. Systems that rely only audio suffer greatly when audio quality is lowered by noise, as is often the case in real-life situations.
This performance loss can be greatly mitigated by adding visual information.
The CNN-LSTM neural networks acieve 68.46% correctness compared to the 57.85% baseline.
Audio-only neural networks achieve 67.03% compared to 65.47% in the baseline.
Lipreading-audio combination networks achieve 75.70% accuracy for clean audio, and 58.55% for audio with an SNR of 0dB. The baseline multimodal network achieved 59% and 44% for clean and noisy audio, respectively.


The networks are implemented using Lasagne.
There is room for improvement of the code; I'll try to improve it if I can find the time.

For the downloading, preprocessing etc of the dataset: see https://github.com/matthijsvk/TCDTIMITprocessing
For the lipreading networks, see the folder code/lipreading
For the audio speech recognition networks, see code/audioSR
For the combination networks see code/combinedSR

Thanks to the authors of all the data and software used in this work. An inexhaustive list:

To Set up Python, I recommend using Anaconda. You can use the provided environment.yml to install all python packages (although some aren't used anymore).
For the installation of Theano/Lasagne and CUDA, I recommend following this tutorial.

If you find this thesis or code useful, please cite according to the following bib entry

@MastersThesis{Vankeirsbilck:Thesis:2017,
    author     =     {Matthijs Van keirsbilck},
    title     =     {{Design, implementation and analysis of a deep convolutional-recurrent neural network for speech recognition through audiovisual sensor fusion}},
    school     =     {KU Leuven},
    address     =     {Belgium},
    year     =     {2017},
    }

multimodalsr's People

Contributors

matthijsvk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.