GithubHelp home page GithubHelp logo

qianqq / transformer-tts-1 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cpuimage/transformer-tts

0.0 1.0 0.0 272 KB

A Tensorflow Implementation like "Neural Speech Synthesis with Transformer Network" Port From OpenSeq2Seq

License: Apache License 2.0

Python 100.00%

transformer-tts-1's Introduction

Transformer-TTS

A Tensorflow Implementation like "Neural Speech Synthesis with Transformer Network" Port From OpenSeq2Seq

Model

Centaur is hand-designed encoder-decoder model based on the Neural Speech Synthesis with Transformer Network and Deep Voice3 papers.

Centaur Model

Encoder

The encoder architecture is simple. It consists of an embedding layer and a few convolutional blocks followed by a linear projection.

Each convolution block is represented by a convolutional layer followed by batch normalization and ReLU with dropout and residual connection:

Centaur Convolutional Block

Decoder

The decoder architecture is more complicated. It is comprised of a pre-net, attention blocks, convolutional blocks, and linear projections.

The pre-net is represented by 2 fully connected layers with ReLU activation and a final linear projection.

The attention block is similar to the transformer block described in the Attention Is All You Need paper, but the self-attention mechanism was replaced with our convolutional block and a single attention head is used. The aim of Centaur's attention block is to learn proper monotonic encoder-decoder attention. Also we should mention here that we add positional encoding to encoder and pre-net outputs without any trainable weights.

The next few convolutional blocks followed by two linear projections predict the mel spectogram and the stop token. Additional convolutional blocks with linear projection are used to predict the final magnitude spectogram, which is used by the Griffin-Lim algorithm to generate speech.

Tips and Tricks

One of the most important tasks of the model is to learn a smooth monotonic attention. If the alignment is poor, the model can skip or repeat words, which is undesirable. We can help the model achieve this goal using two tricks. The first one is to use a reduction factor, which means to predict multiple frames per time step. The smaller this number is, the better the voice quality will be. However, monotonic attention will be more difficult to learn. In our experiments we generate 2 audio frames per step. The second trick is to force monotonic attention during inference using a fixed size window.

Audio Samples

Audio samples with the centaur model can be found : https://nvidia.github.io/OpenSeq2Seq/html/speech-synthesis/centaur-samples.html

References

Donating

If you found this project useful, consider buying me a coffee

Buy Me A Coffee

transformer-tts-1's People

Contributors

cpuimage avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.