mozilla / tts Goto Github PK

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

License: Mozilla Public License 2.0

Python 12.61% Jupyter Notebook 87.22% HTML 0.09% Shell 0.08%

deep-learning text-to-speech python pytorch tacotron tts speaker-encoder dataset-analysis tacotron2 tensorflow2

tts's Introduction

TTS: Text-to-Speech for all.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

📢 English Voice Samples and SoundCloud playlist

👨‍🍳 TTS training recipes

📄 Text-to-Speech paper collection

💬 Where to ask questions

Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly, so that more people can benefit from it.

Type	Platforms
🚨 Bug Reports	GitHub Issue Tracker
❔ FAQ	TTS/Wiki
🎁 Feature Requests & Ideas	GitHub Issue Tracker
👩‍💻 Usage Questions	Discourse Forum
🗯 General Discussion	Discourse Forum and Matrix Channel

🔗 Links and Resources

Type	Links
💾 Installation	TTS/README.md
👩🏾‍🏫 Tutorials and Examples	TTS/Wiki
🚀 Released Models	TTS/Wiki
💻 Docker Image	Repository by @synesthesiam
🖥️ Demo Server	TTS/server
🤖 Running TTS on Terminal	TTS/README.md
✨ How to contribute	TTS/README.md

🥇 TTS Performance

"Mozilla*" and "Judy*" are our models. Details...

Features

High performance Deep Learning models for Text2Speech tasks.
- Text2Spec models (Tacotron, Tacotron2, Glow-TTS, SpeedySpeech).
- Speaker Encoder to compute speaker embeddings efficiently.
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN, WaveGrad, WaveRNN)
Fast and efficient model training.
Detailed training logs on console and Tensorboard.
Support for multi-speaker TTS.
Efficient Multi-GPUs training.
Ability to convert PyTorch models to Tensorflow 2.0 and TFLite for inference.
Released models in PyTorch, Tensorflow and TFLite.
Tools to curate Text2Speech datasets underdataset_analysis.
Demo server for model testing.
Notebooks for extensive model benchmarking.
Modular (but not too much) code base enabling easy testing for new ideas.

Implemented Models

Text-to-Spectrogram

Tacotron: paper
Tacotron2: paper
Glow-TTS: paper
Speedy-Speech: paper

Attention Methods

Guided Attention: paper
Forward Backward Decoding: paper
Graves Attention: paper
Double Decoder Consistency: blog

Speaker Encoder

GE2E: paper
Angular Loss: paper

Vocoders

MelGAN: paper
MultiBandMelGAN: paper
ParallelWaveGAN: paper
GAN-TTS discriminators: paper
WaveRNN: origin
WaveGrad: paper

You can also help us implement more models. Some TTS related work can be found here.

Install TTS

TTS supports python >= 3.6, <3.9.

If you are only interested in synthesizing speech with the released TTS models, installing from PyPI is the easiest option.

pip install TTS

If you plan to code or train models, clone TTS and install it locally.

git clone https://github.com/mozilla/TTS
pip install -e .

Directory Structure

|- notebooks/       (Jupyter Notebooks for model evaluation, parameter selection and data analysis.)
|- utils/           (common utilities.)
|- TTS
    |- bin/             (folder for all the executables.)
      |- train*.py                  (train your target model.)
      |- distribute.py              (train your TTS model using Multiple GPUs.)
      |- compute_statistics.py      (compute dataset statistics for normalization.)
      |- convert*.py                (convert target torch model to TF.)
    |- tts/             (text to speech models)
        |- layers/          (model layer definitions)
        |- models/          (model definitions)
        |- tf/              (Tensorflow 2 utilities and model implementations)
        |- utils/           (model specific utilities.)
    |- speaker_encoder/ (Speaker Encoder models.)
        |- (same)
    |- vocoder/         (Vocoder models.)
        |- (same)

Sample Model Output

Below you see Tacotron model state after 16K iterations with batch-size 32 with LJSpeech dataset.

"Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase the grey matter in the parts of the brain responsible for emotional regulation and learning."

Audio examples: soundcloud

Datasets and Data-Loading

TTS provides a generic dataloader easy to use for your custom dataset. You just need to write a simple function to format the dataset. Check datasets/preprocess.py to see some examples. After that, you need to set dataset fields in config.json.

Some of the public datasets that we successfully applied TTS:

Example: Synthesizing Speech on Terminal Using the Released Models.

After the installation, TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under the TTS project.

Listing released TTS models.

tts --list_models

Run a tts and a vocoder model from the released model list. (Simply copy and paste the full model names from the list as arguments for the command below.)

tts --text "Text for TTS" \
    --model_name "<type>/<language>/<dataset>/<model_name>" \
    --vocoder_name "<type>/<language>/<dataset>/<model_name>" \
    --out_path folder/to/save/output/

Run your own TTS model (Using Griffin-Lim Vocoder)

tts --text "Text for TTS" \
    --model_path path/to/model.pth.tar \
    --config_path path/to/config.json \
    --out_path output/path/speech.wav

Run your own TTS and Vocoder models

tts --text "Text for TTS" \
    --model_path path/to/config.json \
    --config_path path/to/model.pth.tar \
    --out_path output/path/speech.wav \
    --vocoder_path path/to/vocoder.pth.tar \
    --vocoder_config_path path/to/vocoder_config.json

Note: You can use ./TTS/bin/synthesize.py if you prefer running tts from the TTS project folder.

Example: Training and Fine-tuning LJ-Speech Dataset

Here you can find a CoLab notebook for a hands-on example, training LJSpeech. Or you can manually follow the guideline below.

To start with, split metadata.csv into train and validation subsets respectively metadata_train.csv and metadata_val.csv. Note that for text-to-speech, validation performance might be misleading since the loss value does not directly measure the voice quality to the human ear and it also does not measure the attention module performance. Therefore, running the model with new sentences and listening to the results is the best way to go.

shuf metadata.csv > metadata_shuf.csv
head -n 12000 metadata_shuf.csv > metadata_train.csv
tail -n 1100 metadata_shuf.csv > metadata_val.csv

To train a new model, you need to define your own config.json to define model details, trainin configuration and more (check the examples). Then call the corressponding train script.

For instance, in order to train a tacotron or tacotron2 model on LJSpeech dataset, follow these steps.

python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json

To fine-tune a model, use --restore_path.

python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar

To continue an old training run, use --continue_path.

python TTS/bin/train_tacotron.py --continue_path /path/to/your/run_folder/

For multi-GPU training, call distribute.py. It runs any provided train script in multi-GPU setting.

CUDA_VISIBLE_DEVICES="0,1,4" python TTS/bin/distribute.py --script train_tacotron.py --config_path TTS/tts/configs/config.json

Each run creates a new output folder accomodating used config.json, model checkpoints and tensorboard logs.

In case of any error or intercepted execution, if there is no checkpoint yet under the output folder, the whole folder is going to be removed.

You can also enjoy Tensorboard, if you point Tensorboard argument--logdir to the experiment folder.

Contribution Guidelines

This repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details, please read the Mozilla Community Participation Guidelines.

Create a new branch.
Implement your changes.
(if applicable) Add Google Style docstrings.
(if applicable) Implement a test case under tests folder.
(Optional but Prefered) Run tests.

./run_tests.sh

Run the linter.

pip install pylint cardboardlint
cardboardlinter --refspec master

Send a PR to dev branch, explain what the change is about.
Let us discuss until we make it perfect :).
We merge it to the dev branch once things look good.

Feel free to ping us at any step you need help using our communication channels.

Collaborative Experimentation Guide

If you like to use TTS to try a new idea and like to share your experiments with the community, we urge you to use the following guideline for a better collaboration. (If you have an idea for better collaboration, let us know)

Create a new branch.
Open an issue pointing your branch.
Explain your idea and experiment.
Share your results regularly. (Tensorboard log files, audio results, visuals etc.)

Major TODOs

Implement the model.
Generate human-like speech on LJSpeech dataset.
Generate human-like speech on a different dataset (Nancy) (TWEB).
Train TTS with r=1 successfully.
Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
Multi-speaker embedding.
Model optimization (model export, model pruning etc.)

Acknowledgement

https://github.com/keithito/tacotron (Dataset pre-processing)
https://github.com/r9y9/tacotron_pytorch (Initial Tacotron architecture)
https://github.com/kan-bayashi/ParallelWaveGAN (vocoder library)
https://github.com/jaywalnut310/glow-tts (Original Glow-TTS implementation)
https://github.com/fatchord/WaveRNN/ (Original WaveRNN implementation)

tts's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 maozhiqiang wangchuan2008888 codeaudit vproject mlbk xzm2004260 nieshaoshuai jdc08161063 fireae fansinan zhongxingpeng ansidong vjravi zyc868 sandeepcoder097 pbaljeka samprate1st reuben imgaara shyamalschandra stevemurr huokedu michiboo1 cuuupid wurde ericyao2013 guanlongzhao ds-madhavan-ramani hbcbh1999 dsp6414 aiedward xvshiting geeksivan vlinhd11 ramananm anneshachowdhury 201528014227051 davidtranno1 euphonyinc rjunx basselzaity matbilml yushanyong izi-global lbqin 91yuan shartoo fancyerii icewwn unoffices exatasmente jurjsorinliviu nexusminds crazyrex dromosys 0xflotus elvismt qoboty idgmatrix 0xmilly hagamainty arcturus9 yanxiaobin-ben mattanimation hccho2 schemafault pengjichenorg wgwangang oriondream martylinzy crainiac melodya000 lawenliu mazzzystar bamtak entn-at leonzhouwei huguanglong haifengzeng stevenlol yweweler bajibabu thzll2001 mustafaxfe tomzxforks songxianjin sdlibowen josemarcosrf hubeibei007 peter05010402 sh-mahdi fangjyshanghai chienlinhuang1116 clever-scientist wendonggan fpeposhi kitlomer yupbank hemanthbabucigniti

tts's Issues

There is an issue when length of test text is large how to solve it?

Checkpoint Sharing

Hi guys,
Thank you for your work! This is very nice. I was wondering if you could share your trained model so I can play a bit with it without having to train my own which I assume takes a long time. Also would you be kind enough to indicate how long it took you on what kind of hardware?

Thanks!

Please add a LICENSE to this repository.

Hi! This repository contains public unlicensed content, which means that it defaults to "All rights reserved", prohibiting non-Mozilla people from forking/patching/etc. Please consider adding a LICENSE - most likely our standard MPL2 would be appropriate - or, instead, marking this repository as private if it's not intended to be open content.

MPL2 plain-text: https://www.mozilla.org/media/MPL/2.0/index.815ca599c9df.txt

Sound samples redirect to mycroft's mimic and mimic2

They have their own implementation on a different git repository, is it representative of how the currently available trained model here sounds?

Missing keys in State Dictionary

When running both with and without CUDA, using either pretrained model, I get the following error:

RuntimeError: Error(s) in loading state_dict for Tacotron:
	Missing key(s) in state_dict: "decoder.stopnet.1.weight", "decoder.stopnet.1.bias".

The state dict has keys for decoder.prenet but not decoder.stopnet. Is there a workaround to this other than training my own model from scratch?

404 Readme link

You might hear a sample here.

The link goes to a non-existant page.

Fix Multi-GPU training on Pytorch 0.4

following the issue pytorch/pytorch#7092.

Right now, things work on DataParallel with a warning for not contiguous RNN weights for the branches supporting pytorch 0.4

Things are fine with the pytorch 0.3

Implement Griffin-Lim on Pytorch after Pytorch 4.0 release

http://pytorch.org/docs/master/torch.html?highlight=stft#torch.stft still waiting unstable.

make a high quality public domain training set using mozilla deepspeech and librivox (idea\enhancement)

As I understand it, the difference between Google's model and the pretrained available here is the quality and size of the training set.

Would it be possible to take a high quality long librivox recording and use mozilla's STT model to pin point the timing of each spoken word (we already have the ground truth text from librivox, so it's only a matter of timing it)?

We could get some tens of hours of single person recording this way.

Does it make sense? How easy is this to accomplish? I could have a go if it's not hard, haven't messed with deepspeech yet, and haven't looked at how the dataset is encoded yet, so I don't know how hard or important it is.

Testing

I want use pretrained network so I used the Notebook under notebooks folder.
but:
ModuleNotFoundError: No module named 'torchviz'
I use conda and pip for install 'torchviz' but :
"Could not find a version that satisfies the requirement torchviz (from versions: )
No matching distribution found for torchviz"

solve:

pip install git+https://github.com/szagoruyko/pytorchviz

How can I install it and is it pre-trained?

I want to use this mozilla TTS for my text to speech application. How can I install it on ubuntu? and is it per-trained?

English synthsis is good, how about Chinese?

Does this got any blog or attempt on do tts on Chinese?

any experience of unstable tacotron?

first, thank you for this amazing source and hard work!
I really love it.

I met some bad case like skipping word, unstopped case, repeat word, any insight or trying on this topic?

Tacotron: Train TWEB dataset

Dataset: https://www.kaggle.com/bryanpark/the-world-english-bible-speech-dataset

why use attention smoothing?

I saw this:
alignment = torch.sigmoid(alignment) / torch.sigmoid(alignment).sum(dim=1).unsqueeze(1)

any experiment on this, is it improving anything?

CPU training compatible

For some poor people, we have to using CPU to train. I found this line codes not work on CPU:

 # forward pass
        mel_output, linear_output, alignments, stop_tokens = torch.nn.parallel.data_parallel(
            model, (text_input, mel_input, mask))

Just only this code. How to change it to make it run on CPU?

Try MSE loss over L1 Loss

How to train a male voice?

How train a male voice?

Instead of masking loss try masking outputs.

The Checkpoints are invalid after download and trying to untar. What should I do?

Will you be posting new version of the following in a location other than Google Drive?

iter-62410
iter-170k
iter-270k

Thanks!

Tacotron: Implement Truncated Back Propagation

At long datasets like TWEB, network suffers to learn since the length of the sequences are too long to have stable back-propagation signal. One solution is to use TBP where you cut-down long sequences into smaller pieces by a outer loop and perform BP multiple times as tracing the long sequence.

Here is a example implementation. https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/language_model/main.py

Tacotron: Detect sentence ending explicitly instead of thresholding

prenet dropout

I was using another repo previously, and now I am switching to mozilla TTS;

according to my experience, the dropout in decoder prenet also used in inference, without dropout in inference, the quality is bad(tacotron 2), which is hard to understand,

do you get similar experience and why?

Tacotron2 + WaveRNN experiments

Tacotron2: https://arxiv.org/pdf/1712.05884.pdf
WaveRNN: https://github.com/erogol/WaveRNN forked from https://github.com/fatchord/WaveRNN

The idea is to add Tacotron2 as another alternative if it is really useful then the current model.

Code boilerplate tracotron2 architecture.
Train Tacotron2 and compare results (Baseline)
Train TTS current model in a comparable size with T2. (Current TTS model has 7M and Tacotron2 has 28M parameters)
Add TTS specific architectural changes to T2 and compare with the baseline.
Train WaveRNN a vocoder on generated spectrograms
Train a better stopnet. Stopnet sometimes misses the prediction that leads to unstable predictions. Maybe it is better to use a RNN as previous TTS version.
Release LJspeech Tacotron 2 model. (soon)
Release LJSpeech WaveRNN model. (https://github.com/erogol/WaveRNN)

Best result so far: https://soundcloud.com/user-565970875/ljspeech-logistic-wavernn

Some findings:

Adding an entropy loss for the attention seems to improve the cases hard to learn the alignment. It forces network to learn more sparse and noise free alignment weights.

entropy = torch.distributions.Categorical(probs=alignments).entropy()
entropy_loss = (entropy / np.log(alignments.shape[1])).mean()
loss += 1e-4 * entropy_loss

Here is the alignment with entropy loss. However, if you keep the loss weight high, then it degrades the model's generalization for new words.

Replacing Prenet with a BatchNorm version ehnace the performance quite a lot.
A network with BN Prenet is harder to learn the attention. It looks like the network needs a level of noise onto autoregressive connection to relate encoder output to network output. Otwerwise, in teacher forcing mode, network does not need encoder output since it finds previous prediction frame enough to generate the next frame.
Forward attention seems more robust to longer sequences and faster to align. (https://arxiv.org/abs/1807.06736)

Add The World English Bible

Here is the link for the dataset: https://www.kaggle.com/bryanpark/the-world-english-bible-speech-dataset/feed

Experiment with Voice Loop way of teacher forcing on auto-regressive connection

Min DB and Ref DB

Neither the Tacotron 2 or Tacotron paper mentioned anything about any decibel normalization, can you help me understand why this is necessary?

Relevant config:

 "min_level_db": -100,
  "ref_level_db": 20,

Tacotron: Use Convolutions instead of RNN

Use WORLD vocoder

Tacotron: Trying r < 5

Expecting better fidelity with r=2, which is also the setting used by the original paper.

Our previous runs use r=5 for the benefit of faster training.

Server branch does not have setup.py

synthesizer wav length

Hi,
while testing with ljspeech model I found that a maximum wav file length is 12s, generated by synthesizer. Is it limited by trained dataset initially or has any options to change it?

Adapt Pytorch 0.4

No module named 'TTS'

Python version: 3.6.6
requirements.txt installation successful.
python3.6 setup.py develop successfully completed.

But TTS module not generated.

>>> from TTS.models.tacotron import Tacotron
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'TTS'

Are there any paths to be set? I am not using virtualenv.

Tried the same with miniconda as well, but no luck.

Error with 'initial_lr' parameter

Environment:

Python 3.6
PyToch 0.4.1
Cuda 9.1

I encountered the following error while trying to train a model from LJSpeech following the steps in README.md:

Traceback (most recent call last):
File "train.py", line 493, in
main(args)
File "train.py", line 433, in main
scheduler = AnnealLR(optimizer, warmup_steps=c.warmup_steps, last_epoch=args.restore_step)
File "/workspace/TTS/utils/generic_utils.py", line 148, in init
super(AnnealLR, self).init(optimizer, last_epoch)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 20, in init
"in param_groups[{}] when resuming an optimizer".format(i))
KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"

After digging around a bit, it looks like the problem is with the 'last_epoch=args.restore_step' argument to AnnealLR() call. This argument is set in train.py to zero when not using a checkpoint on line 425:

args.restore_step = 0

However, the lr_scheduler.py module expects "-1" for the initial epoch. I changed zero to -1 in line 425

args.restore_step = -1

and the training from scratch seems to be working now.

Inference time metrics?

Hello, I am very interested in this project, I am looking for a pytorch implementation of Tacotron/Tacotron2/WaveNet and may wish to contribute. Do you have any metrics on forward-pass time for inference on new text? I am looking to export a PyTorch model into Caffe2 and run it on a mobile platform.

There is an issue when length of test text is large how to solve it?

Where to find metadata_val.csv

I downloaded the dataset from https://keithito.com/LJ-Speech-Dataset/ by click on Download button. Assuming it to be enough I ran python train.py --config_path config.json by modifying the config file according to my own computer.

First it complained about missing metadata_train.csv and then about missing metadata_val.csv .

In the readme, there is no mention whether I need to run anything else to do preprocessing. So maybe something I am missing.

To try to fix, I copied metadata.csv into metadata_train.csv and metadata_val.csv files and gave it a run and got following error:

 > Git Hash: 186a81c
 > Experiment folder: /Users/manish/Work/TTS/experiments/July-11-2018_04:51PM-best-model-186a81c
 > Reading LJSpeech from - /Users/manish/Downloads/LJSpeech-1.1/wavs
 | > Number of instances : 13100
 | > Max length sequence 187
 | > Min length sequence 5
 | > Avg length sequence 98.34648854961831
 | > 0 instances are ignored by min_seq_len (0)
 > Reading LJSpeech from - /Users/manish/Downloads/LJSpeech-1.1/wavs
 | > Number of instances : 13100
 | > Max length sequence 187
 | > Min length sequence 5
 | > Avg length sequence 98.34648854961831
 | > 0 instances are ignored by min_seq_len (0)
 | > Number of characters : 149

 > Starting a new training
 | > Model has 7385090 parameters
 | > Epoch 0/1000
 ! Run is removed from /Users/manish/Work/TTS/experiments/July-11-2018_04:51PM-best-model-186a81c
Traceback (most recent call last):
  File "train.py", line 434, in <module>
    main(args)
  File "train.py", line 424, in main
    model, criterion, criterion_st, train_loader, optimizer, optimizer_st, epoch)
  File "train.py", line 112, in train
    model.forward(text_input, mel_input)
  File "/Users/manish/Work/TTS/models/tacotron.py", line 31, in forward
    encoder_outputs, mel_specs)
  File "/Users/manish/miniconda3/envs/tts/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/manish/Work/TTS/layers/tacotron.py", line 242, in forward
    memory = memory.view(B, memory.size(1) // self.r, -1)
RuntimeError: invalid argument 2: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Call .contiguous() before .view(). at /Users/soumith/minicondabuild3/conda-bld/pytorch_1524590658547/work/aten/src/TH/generic/THTensor.cpp:280

Notebook Sample Generation results in a RuntimeError

I have begun training TTS on the en_UK corpus released by M-AILABS. However, I doubt that the behaviour I experienced is related to the latter.

Essentially, I followed the notebooks given (and placed it into a python script generate.py) and both have resulted in the following error:

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got CUDAIntTensor instead (while checking arguments for embedding)

It seems to trace back to either the model or the create_speech function.

Are the notebooks just outdated or is this unusual behaviour?

High Quality Audio Samples?

Do you have audio samples of similar quality to:
https://google.github.io/tacotron/publications/tacotron/index.html

Thanks!

NameError: name 'mode' is not defined

Hi,
When I am running the benchmark with checkpoint_272976.pth.tar I am getting this error:

`` | > Number of characters : 149
Traceback (most recent call last):
File "ttsbechmark.py", line 46, in
model = Tacotron(CONFIG.embedding_size, CONFIG.num_freq, CONFIG.num_mels, CONFIG.r)
File "/media/hamza/Local Disk/Projects/Untitled Folder/TTS/models/tacotron.py", line 20, in init
self.decoder = Decoder(256, mel_dim, r)
File "/media/hamza/Local Disk/Projects/Untitled Folder/TTS/layers/tacotron.py", line 203, in init
self.mode = mode
NameError: name 'mode' is not defined

Thanks

Adapt code to Pytorch 0.4

Tacotron: Advanced attention module (e.g. Monotonic attention)

KeyError: ((1, 1, 1000), '|u1') && tensorboradX error

Traceback (most recent call last):
  File "/home/jackie/anaconda3/envs/tts/lib/python3.6/site-packages/PIL/Image.py", line 2460, in fromarray
    mode, rawmode = _fromarray_typemap[typekey]
KeyError: ((1, 1, 1000), '|u1')

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "train.py", line 477, in <module>
    main(args)
  File "train.py", line 468, in main
    val_loss = evaluate(model, criterion, criterion_st, val_loader, current_step)
  File "train.py", line 310, in evaluate
    tb.add_image('ValVisual/Reconstruction', const_spec, current_step)
  File "/home/jackie/anaconda3/envs/tts/lib/python3.6/site-packages/tensorboardX/writer.py", line 412, in add_image
    self.file_writer.add_summary(image(tag, img_tensor), global_step, walltime)
  File "/home/jackie/anaconda3/envs/tts/lib/python3.6/site-packages/tensorboardX/summary.py", line 205, in image
    image = make_image(tensor, rescale=rescale)
  File "/home/jackie/anaconda3/envs/tts/lib/python3.6/site-packages/tensorboardX/summary.py", line 243, in make_image
    image = Image.fromarray(tensor)
  File "/home/jackie/anaconda3/envs/tts/lib/python3.6/site-packages/PIL/Image.py", line 2463, in fromarray
    raise TypeError("Cannot handle this data type")
TypeError: Cannot handle this data type

Thx in advance! I guess it is because of the different version of tensorboardX?

librosa.util.exceptions.ParameterError: Target size (38) must be at least input size (1100)

After running the server( python3 server/server.py -c server/conf.json),
I tried to test text through web browser.

But, Message "librosa.util.exceptions.ParameterError: Target size (38) must be at least input size (1100)" has occurred

Below is the full message.

File "tts/utils/audio.py", line 110, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File "tts/utils/audio.py", line 124, in _stft
y=y, n_fft=self.n_fft, hop_length=self.hop_length, win_length=self.win_length)
File "tts/lib/python3.6/site-packages/librosa-0.5.1-py3.6.egg/librosa/core/spectrum.py", line 152, in stft
fft_window = util.pad_center(fft_window, n_fft)
File "tts/lib/python3.6/site-packages/librosa-0.5.1-py3.6.egg/librosa/util/utils.py", line 287, in pad_center
'at least input size ({:d})').format(size, n))
librosa.util.exceptions.ParameterError: Target size (38) must be at least input size (1100)

Adapt nv-wavenet

NVIDIA/nv-wavenet#7

Where is the newest model

I download the 272976 iter model, and run notebooks synthesis.py got error:

RuntimeError: Error(s) in loading state_dict for Tacotron:
	Missing key(s) in state_dict: "encoder.cbhg.cbhg.conv1d_banks.0.conv1d.weight", "encoder.cbhg.cbhg.conv1d_banks.0.bn.weight", "encoder.cbhg.cbhg.conv1d_banks.0.bn.bias", "encoder.cbhg.cbhg.conv1d_banks.0.bn.running_mean", "encoder.cbhg.cbhg.conv1d_banks.0.bn.running_var", "encoder.cbhg.cbhg.conv1d_banks.1

Thanks!

mozilla / tts Goto Github PK

tts's Introduction

TTS: Text-to-Speech for all.

💬 Where to ask questions

🔗 Links and Resources

🥇 TTS Performance

Features

Implemented Models

Text-to-Spectrogram

Attention Methods

Speaker Encoder

Vocoders

Install TTS

Directory Structure

Sample Model Output

Datasets and Data-Loading

Example: Synthesizing Speech on Terminal Using the Released Models.

Example: Training and Fine-tuning LJ-Speech Dataset

Contribution Guidelines

Collaborative Experimentation Guide

Major TODOs

Acknowledgement

tts's People

Contributors

Stargazers

Watchers

Forkers

tts's Issues

solve:

Recommend Projects

Recommend Topics

Recommend Org

Jobs