GithubHelp home page GithubHelp logo

yjingyu / wavenet-demo Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cynthiasuwi/wavenet-demo

0.0 1.0 0.0 22 KB

A TensorFlow implementation for Chinese speech recognition based on DeepMind's WaveNet

License: GNU General Public License v3.0

Python 100.00%

wavenet-demo's Introduction

Speech-to-Text-WaveNet : End-to-end sentence level Chinese speech recognition using DeepMind's WaveNet

A tensorflow implementation for Chinese speech recognition based on DeepMind's WaveNet: A Generative Model for Raw Audio.(Related Paper)

The architecture of Wavenet is very interesting, integrating dilated CNN, residual network, CTC, the gate in LSTM, the 1*1 convolution kernel and other classic structures. The architecture is shown in the following figure:

My practice is based on THCHS-30. Because there is not enough time to optimize the parameter of the model. At present, the recognition accuracy is not high. This repository will continue to update as a record.

  1. CTC is the standard configuration of the end to end speech recognition system nowadays. It solves the one-to-multiply mapping problem of text sequence and output of neural network model, therefore can help construct end to end system.
  2. The research on the 11 convolution kernal was first seen in the 14 year paper Network In Network, and was later carried forward by Google in the Inception model. It can integrate information from multi channels for nonlinear transformation, and can increase or reduce dimensions of channels(all convolution kernal can change dimensions of channels, but the advantage of 11 kernal is that it uses least parameters), thus can jump links in neural network models.
  3. In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers.By stacking ResNet blocks on top of each other, you can form a very deep network.
  4. The classic gate structure in the LSTM model can effectively extract input information and use it in long length (long time) effective information correlation architecture, such as Natural Language Processing.
  5. A dilated convolution is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step. A dilated convolution effectively allows the network to operate on a coarser scale than with a normal convolution.

As shown above, the network layer in the model(n_layers) is 4, the size of the convolution kernel(filter_width) is 2, the classical CNN model, receptive field size is 2+4-1=5, using only input data of 5 dimensional information, and the dilated CNN model of receptive field size is the fourth power of 2(16). This is similar to pooling or strided convolutions, but here the output has the same size as the input.

The strategies implemented in the codes includes:

  1. Xavier initialization
  2. Shuffle data for each epoch
  3. Batch normalization
  4. 3 blocks is implemented, each contains 5 dilated convolutional layers.
  5. Adam optimizer

Version

Current version: 0.0.1(Draft)

Prerequisites

  • python 3.5
  • tensorflow 1.0.0
  • librosa 0.5.0

Dataset

THCHS-30

Directories

  • cache: save data feature and word dictionary
  • data: wav files and related labels
  • model: save the models

Network model

  • Data random shuffle per epoch
  • Xavier initialization
  • Adam optimization algorithms
  • Batch Normalization

Train the network

python3 train.py

Test the network

python3 test.py

References

wavenet-demo's People

Contributors

cynthiasuwi avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.