ibab / tensorflow-wavenet Goto Github PK

A TensorFlow implementation of DeepMind's WaveNet paper

License: MIT License

Python 99.55% Shell 0.45%

tensorflow-wavenet's Introduction

A TensorFlow implementation of DeepMind's WaveNet paper

This is a TensorFlow implementation of the WaveNet generative neural network architecture for audio generation.

The WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation (see the DeepMind blog post and paper for details).

The network models the conditional probability to generate the next sample in the audio waveform, given all previous samples and possibly additional parameters.

After an audio preprocessing step, the input waveform is quantized to a fixed integer range. The integer amplitudes are then one-hot encoded to produce a tensor of shape (num_samples, num_channels).

A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.

The core of the network is constructed as a stack of causal dilated layers, each of which is a dilated convolution (convolution with holes), which only accesses the current and past audio samples.

The outputs of all layers are combined and extended back to the original number of channels by a series of dense postprocessing layers, followed by a softmax function to transform the outputs into a categorical distribution.

The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.

In this repository, the network implementation can be found in model.py.

Requirements

TensorFlow needs to be installed before running the training script. Code is tested on TensorFlow version 1.0.1 for Python 2.7 and Python 3.5.

In addition, librosa must be installed for reading and writing audio.

To install the required python packages, run

pip install -r requirements.txt

For GPU support, use

pip install -r requirements_gpu.txt

Training the network

You can use any corpus containing .wav files. We've mainly used the VCTK corpus (around 10.4GB, Alternative host) so far.

In order to train the network, execute

python train.py --data_dir=corpus

to train the network, where corpus is a directory containing .wav files. The script will recursively collect all .wav files in the directory.

You can see documentation on each of the training settings by running

python train.py --help

You can find the configuration of the model parameters in wavenet_params.json. These need to stay the same between training and generation.

Global Conditioning

Global conditioning refers to modifying the model such that the id of a set of mutually-exclusive categories is specified during training and generation of .wav file. In the case of the VCTK, this id is the integer id of the speaker, of which there are over a hundred. This allows (indeed requires) that a speaker id be specified at time of generation to select which of the speakers it should mimic. For more details see the paper or source code.

Training with Global Conditioning

The instructions above for training refer to training without global conditioning. To train with global conditioning, specify command-line arguments as follows:

python train.py --data_dir=corpus --gc_channels=32

The --gc_channels argument does two things:

It tells the train.py script that it should build a model that includes global conditioning.
It specifies the size of the embedding vector that is looked up based on the id of the speaker.

The global conditioning logic in train.py and audio_reader.py is "hard-wired" to the VCTK corpus at the moment in that it expects to be able to determine the speaker id from the pattern of file naming used in VCTK, but can be easily be modified.

Generating audio

Example output generated by @jyegerlehner based on speaker 280 from the VCTK corpus.

You can use the generate.py script to generate audio using a previously trained model.

Generating without Global Conditioning

Run

python generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

where logdir/train/2017-02-13T16-45-34/model.ckpt-80000 needs to be a path to previously saved model (without extension). The --samples parameter specifies how many audio samples you would like to generate (16000 corresponds to 1 second by default).

The generated waveform can be played back using TensorBoard, or stored as a .wav file by using the --wav_out_path parameter:

python generate.py --wav_out_path=generated.wav --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Passing --save_every in addition to --wav_out_path will save the in-progress wav file every n samples.

python generate.py --wav_out_path=generated.wav --save_every 2000 --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Fast generation is enabled by default. It uses the implementation from the Fast Wavenet repository. You can follow the link for an explanation of how it works. This reduces the time needed to generate samples to a few minutes.

To disable fast generation:

python generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000 --fast_generation=false

Generating with Global Conditioning

Generate from a model incorporating global conditioning as follows:

python generate.py --samples 16000  --wav_out_path speaker311.wav --gc_channels=32 --gc_cardinality=377 --gc_id=311 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Where:

--gc_channels=32 specifies 32 is the size of the embedding vector, and must match what was specified when training.

--gc_cardinality=377 is required as 376 is the largest id of a speaker in the VCTK corpus. If some other corpus is used, then this number should match what is automatically determined and printed out by the train.py script at training time.

--gc_id=311 specifies the id of speaker, speaker 311, for which a sample is to be generated.

Running tests

Install the test requirements

pip install -r requirements_test.txt

Run the test suite

./ci/test.sh

Missing features

Currently there is no local conditioning on extra information which would allow context stacks or controlling what speech is generated.

Related projects

tex-wavenet, a WaveNet for text generation.
image-wavenet, a WaveNet for image generation.

tensorflow-wavenet's People

Contributors

Stargazers

Watchers

Forkers

shekkbuilder lfthwjx jfsantos silky vsooda amos-zq liangpj standy66 pineking huyouare stevenlol xiximeng phpmind beronx86 jasondmuscut juangon lemonzi woodshop mecab snehesht macsj200 wavelets jdupl123 hbcbh1999 laventura andrenatal eftychis clear-view cnsnyder ygunbatar jyegerlehner kwstewar shaoweipng junyu-w acrive82 johnsonc aarzhaev benjamesbabala shobhitmittal msfeldstein ai-kit pukkapies bluemustache hedgefair tianlongwang pbaljeka mortont ml-ai-nlp-ir zectbynmo jason8kang ahn19 phamquyhai nanophilian yanzqing bgshin inureyes zachlungu genekogan deeplearningresource lyk125 rhythm92 undercontroller chanil1218 basveeling tomlepaine ulopax zhaoyang10 philippjfr abhi3p scatterbrain333 datavizweb jianbohuang ligz07 lelayf georgenagel r-zemblys aliscifp lab-x wtest qigongsun yluo42 ucasyouzhao sonach jmliu88 dsksd minganlin denissergeevitch daitomanabe yunfanz adroit91 techscientist barneyeldinosaurio maxhodak shivajid brickjava haha517 magicknight zhangyangbill nyrt nakosung

tensorflow-wavenet's Issues

add regularization, dropout and batch norm?

Has anybody got loss lower than ~2? Tried couple of configurations (default, 3 and 4 stacks of 10 dilation layers), but loss does not get lower, suggesting the network is not learning anymore.

Also, there is what happened happened after ~30k steps:

I believe this is the same problem as reported in #30. There is what happens with weights:

Now running the same network with l2 norm regularization added.

And one more note: training just stops after 44256 steps (already happened twice) without any warnings or errors, despite of num_steps=50000

Fast generation

I'm trying to understand the differences between the implementation in wavenet.py in this repository and the implementation in @tomlepaine's fast-wavenet. I think I understand the insight in fast-wavenet but are there drawbacks associated with it that mean this repository wouldn't want to just adopt it as the main implementation? Why keep separate "fast" and (presumably) "slow" models?

Initialization of variables

Are the magic numbers for weight initialization chosen for particular reasons (e.g. stddev=0.3)? Would we prefer something like tf.contrib.layers.xavier_initializer?

Singing

It could be really interesting to train with singing solo tracks and Lilypond. But probably could be too hard to collect a datest handing copyrights issues.

Number of seconds per step

Please, I have a simple question.

I'm training WaveNet with the VCTK corpus dataset in a CPU only machine (Intel i5) running to 20 sec/step. At this rate, I can reach about 4000 steps per day, 30.000 at week. Question is: ¿How much steps do we need to get a loss from about 2? And, another question: Can anybody tell me what is the seconds per step rate that a good GPU can reach? By the way, what's the rate in which each of you are working on guys?

Regards,
Samu.

Note. I'm using the default params:
{
"filter_width": 2,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
1, 2, 4, 8, 16, 32, 64, 128, 256],
"residual_channels": 32,
"dilation_channels":16,
"quantization_channels": 256,
"skip_channels": 256,
"use_biases": false
}

Larger mini batch instead of long input frame

What is the desirable size for input frame? I think mini batch would be beneficial so we can decrease frame size to get larger mini batch.

Isn't it necessary?

Training error in main.py

Getting the following error when I try to train the network - any idea what this is?

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:01:00.0
Total memory: 12.00GiB
Free memory: 11.53GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
  File "main.py", line 129, in <module>
    main()
  File "main.py", line 83, in main
    loss = net.loss(audio_batch)
  File "/home/seth/Development/tensorflow-wavenet/wavenet.py", line 97, in loss
    raw_output = self._create_network(encoded)
  File "/home/seth/Development/tensorflow-wavenet/wavenet.py", line 67, in _create_network
    dilation=dilation)
  File "/home/seth/Development/tensorflow-wavenet/wavenet.py", line 23, in _create_dilation_layer
    name="conv_f")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 168, in atrous_conv2d
    in_height = int(value_shape[1])
TypeError: __int__ returned non-int (type NoneType)

Bug with _causal_dilated_conv?

I test _causal_dilated_conv in wavenet.py with the following toy example:
batch_size: 1
height: 1
width: 20
in_channel: 1
out_channel: 1
dilation: 4

The “value” tensor is set with shape [1, 1, 20, 1], and the values are from 1 to 20. The “filter” is set with shape [1, 2, 1, 1], and both values are 1.

The ”out“ tensor is:
[[[[ 3.], [ 5.], [ 7.], [ 9.], [ 13.], [ 15.], [ 17.], [ 19.], [ 23.], [ 25.], [ 27.], [ 29.], [ 33.], [ 35.], [ 37.], [ 39.]]]]

I suppose the correct ”out“ should be:
[[[[ 6.], [ 8.], [ 10.], [ 12.], [ 14.], [ 16.], [ 18.], [ 20.], [ 22.], [ 24.], [ 26.], [ 28.], [ 30.], [ 32.], [ 34.], [ 36.]]]]

In this example, the "reshaped" tensor is:
[[[[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]],
[[[ 6.], [ 7.], [ 8.], [ 9.], [ 10.]]],
[[[ 11.],[ 12.],[ 13.],[ 14.],[ 15.]]],
[[[ 16.],[ 17.],[ 18.],[ 19.],[ 20.]]]]

I guess it should be
[[[[ 1.], [ 5.], [ 9.], [ 13.], [ 17.]]],
[[[ 2.], [ 6.], [ 10.], [ 14.], [ 18.]]],
[[[ 3.], [ 7.], [ 11.], [ 15.], [ 19.]]],
[[[ 4.], [ 8.], [ 12.], [ 16.], [ 20.]]]]
and then the result should be correct.

Bootstrapping the generation with existing audio

Currently, we start the generation with a randomly picked waveform sample.
I wonder what kind of effect that has, considering that the dilated convolutions won't be able to reach backwards beyond the beginning of the generated sample.
Maybe we should start off with one of the audio recordings.

OOM on GTX 1080

Hi Igor, I'm getting OOM on GTX1080. Reduced sample size to one directory (175 files) and still getting this error. Do you have any ideas how to fit a tensor in memory on 8Gb cards?

I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: 
Limit:                  7690878976
InUse:                  7325787648
MaxInUse:               7465885184
NumAllocs:                    9793
MaxAllocSize:           2779725056

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ****_********************_*****************************************************************xxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 928.75MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:940] Resource exhausted: OOM when allocating tensor with shape[256,256,1,3715]
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 1.33GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
Traceback (most recent call last):
  File "train.py", line 151, in <module>
    main()
  File "train.py", line 136, in main
    summary, loss_value, _ = sess.run([summaries, loss, optim])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 710, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 908, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 958, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 978, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[256,256,1,3715]
     [[Node: gradients/dilated_stack/layer4/conv_f_grad/Conv2DBackpropInput = Conv2DBackpropInput[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/dilated_stack/layer4/conv_f_grad/Shape, dilated_stack/layer4/Variable/read, gradients/dilated_stack/layer4/conv_f/BatchToSpace_grad/SpaceToBatch)]]
Caused by op u'gradients/dilated_stack/layer4/conv_f_grad/Conv2DBackpropInput', defined at:

Thank you in advance

Error training multiple epochs

Training on a truncated sample of the VCTK corpus causes it to hang with no error if it hits the end of the corpus before the number of steps is complete. It looks like this is due to iterate_through_vctk not resetting once it hits the end. It looks like this is also happening in #65 on the full corpus, since it's stalling at 44256 steps, which is close to the samples in the VCTK corpus (109 speakers x ~400 samples/speaker).

I've tested this with an expanded file list (to iterate through multiple epochs) and it no longer stops at the end of the corpus.

I suggest we implement a tf.train.string_input_producer to iterate through epochs. That way we can also shuffle input, as has been mentioned in the discussion in #47.

Change learning rate during training

Would it be helpful to add ability to manually change learning rate during training? I have naive implementation of this feature here.

Excessive memory consumption

The network currently runs into out of memory issues at a low number of layers.
This seems to be a problem with TensorFlow's atrous_conv2d operation.
If I set the dilation factor to 1, which means atrous_conv2d simply calls conv2d, I can easily run with 10s of layers.
It could just be the additional batch_to_space and space_to_batch operations, in which case I can write a single C++ op for atrous_conv2d.

Generated silence

Hi,

I followed all the instructions on training a model from the README(from the default dataset). Then I used the generation script to generate a few seconds of sound, again, according to the instructions, but unfortunately I got silence.

Any idea of what I might got wrong?

GPU OOM

Running train.py on Nvidia GRID K2, 1536 cores, 3.5 GB memory.

I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: name: GRID K2 major: 3 minor: 0 memoryClockRate (GHz) 0.745 pciBusID 0000:00:04.0 Total memory: 3.50GiB Free memory: 3.45GiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K2, pci bus id: 0000:00:04.0) Trying to restore saved checkpoints from ./logdir/train/2016-09-20T19-23-04 ... No checkpoint found. step 0 - loss = 7.461, (13.180 sec/step) Storing checkpoint to ./logdir/train/2016-09-20T19-23-04 ... Done.

Then the BFC allocator runs out of memory:

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **************************************************************************************__**********xx W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 194.79MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:940] Resource exhausted: OOM when allocating tensor with shape[199462,256] I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3463 get requests, put_count=3233 evicted_count=1000 eviction_rate=0.30931 and unsatisfied allo cation rate=0.38406

Do you know if there is a setting I can specify to work around GPU memory limitations ?

Switch to YAML

Should we use YAML instead of JSON for the parameters?

👍 : More human-friendly, more compact.
👍 : We can add comments explaining the meaning of each parameter.
👎 : Not native, we would need to add pyYAML to the dependencies.

Feeding raw audio waveform into the first layer

We've discussed the fact that one-hot encoding the input to the network is kind of weird, and that it would be more natural to use the waveform as a single-channel floating point tensor instead.
Does anyone have experience with running our implementation in this way?
Should we switch to this method?

CLI Flags, JSON configuration files, and default values

The current setup with some parameters specified as CLI flags and others in the JSON file is not optimal. There are two competing needs:

We would like to group related parameters in a JSON file so that it's easy to share model configurations with the community. Attaching the JSON file to a saved checkpoint makes it trivial to replicate experiments and test networks that others with more resources have trained, thus helping democratize the platform.
Changing individual parameters during development should be very, very easy.

We should find a way to accommodate both. I propose:

All parameters are defined as flags.
A list of JSON files are read sequentially and merged, and finally the supplied flags are merged with the combined JSON dictionary for manual overrides.
We would have a JSON file with model parameters, a JSON file with training parameters (learning rate, etc.), and users could add another one with their specific overrides.

It still doesn't feel optimal, though. Any suggestions?

Can't generate with biases

There's no way to turn off fast generation that I can find, and a model using biases needs to have it switched off in order to generate. The --fast_generation command line arg is always True no matter what you do.

Cannot create output from generated model - tries loading wrong variables

I am trying to generate.py using python generate.py --samples 16000 ./model.ckpt-3999

However, the loading process fails with the following error:

Caused by op u'save/restore_slice', defined at:
  File "generate.py", line 174, in <module>
    main()
  File "generate.py", line 113, in main
    saver = tf.train.Saver(variables_to_restore)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 986, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1015, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 620, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 357, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 270, in restore_op
    preferred_shard=preferred_shard))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 204, in _restore_slice
    preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 359, in _restore_slice
    preferred_shard=preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Tensor name "wavenet/causal_layer/Variable" not found in checkpoint files ./model.ckpt-3999
     [[Node: save/restore_slice = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/restore_slice/tensor_name, save/restore_slice/shape_and_slice)]]
     [[Node: save/restore_slice_48/_117 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_36_save/restore_slice_48", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

The tensors in my checkpoint file have names that look deliberate, for example, //wavenet/causal_layer/filter/Adam

However, some println debugging shows that the saver in generate.py is trying to load very generic variable names like this:

wavenet/causal_layer/Variable
wavenet/dilated_stack/layer0/Variable
wavenet/dilated_stack/layer0/Variable_1
wavenet/dilated_stack/layer0/Variable_2
wavenet/dilated_stack/layer0/skip
wavenet/dilated_stack/layer1/Variable

I'm using tensorflow master compiled against Cuda 8.0 RC.

Not sure if this is an issue with my setup or a bug. Any help is greatly appreciated.

Consider cropping samples to a fixed maximum length

It's not ideal that the length of the samples varies wildly.
This could be fixed by cropping (or even padding) them to a fixed size.

Separate 1x1 convolution for skip connection

One of the authors mentioned that the skip connections are connected to a separate 1x1 convolution than the one which output goes into the add block.

How to use TensorBoard for "The generated waveform can be played back using TensorBoard."

How to use TensorBoard?

Add unit tests

I'd like to add unit tests before extending the implementation further.
That should prevent us from pushing non-working versions of the code in the future.
TensorFlow has a nice API for tests, which I'd like to use for this: https://www.tensorflow.org/versions/r0.10/api_docs/python/test.html#testing

Does anyone has some pre-trained model which we can download it?

My computer is not very powerful and I don't have the resources available in my university, can someone or could the readme.md give us a link where we can download a pre-trained model?

Generated samples are non-negative

For my case, all the generated audio samples seem to be positive integer values.

I used the following architecture for the network. (wavenet_params.json)
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512],

I wonder if this only happens to me.. and want to know how to fix it. :)

Improve GPU performance

The performance of the network on GPUs seems to be lagging behind the CPU performance.
I suspect that this is because the 2D convolution isn't designed to work efficiently if the height of the input is 1.
It shouldn't be too difficult to write some custom code to perform an efficient 1D convolution.
For example, fft could be used for this.

Global condition and Local conditioning

In the white paper, they mention conditioning to a particular speaker as an input they condition globally, and the TTS component as an up-sampled (deconvolution) conditioned locally. For the latter, they also mention that they tried just repeating the values, but found it worked less well than doing the deconvolutions.

Is there effort underway to implement either of these? Practically speaking, implementing the local conditioning would allow us to begin to have this implementation speak recognizable words.

Training not converged

I train the network with more steps, and found that the loss increases around about 9k steps, as follows:

Since the VCTK-Corpus has about 44k wave files, and the batch size is 1, the loss increases during the first epoch.

Rewrite the input pipeline using a proper Python audio library

Unfortunately, there's currently no easy way to disable the verbose logging output from ffmpeg that results from loading the .wav files into TensorFlow.
(I've recompiled TensorFlow to disable the output while developing the network).

I can use a different library to decode the .wav files, but that would add an extra dependency (and some extra code).

AudioReader.stop_threads

There're two issues with this method:

The definition is incorrect, it lacks a self argument, it should be:def stop_threads(self):
The implementation is problematic, since Python thread object doesn't have a stop() method, there's no official way to stop a thread in Python.

Seems to me we should remove this method all together...

Generating good audio samples

Let's discuss strategies for producing audio samples.
When running over the entire dataset, I've so far only managed to reproduce recording noise and clicks.

Some ideas I've had to improve on this:

We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset. We could also try overfitting the dataset a little, which should result in the network reproducing pieces of the train dataset.
Remove silence from the recordings. Many of the recordings have periods of recording noise before and after the speakers. It might be worth removing these with librosa.

Better silence trimming

The current implementation computes the short-time RMSE amplitude, applies a given threshold on its value (the units are raw amplitude), and trims the audio from the beginning until a frame above the threshold is found, and from the last frame above the threshold until the end. If the result is empty, that file is discarded.

At some point, we might want to implement a better algorithm, such as using a threshold relative to the maximum amplitude in the example or applying a smoothing filter. The algorithm should also be configurable.

Continuity on trained model

It would be great continue learning from the someone's trained model or paused mine.

Testing the network on music datasets

I've started to play around with the MagnaTagATune dataset.
There's a small change that needs to be made to the code when training on this dataset:
Because it uses mp3 instead of wav, the pattern in wavenet/audio_reader.py needs to be adjusted.
It would be nice to write a MagnaReader class that inherits from the AudioReader (or contains one), and that's able to filter the content by genre using the provided metadata.

Make sure all variables have names

Some of the variables don't have a name, which makes it harder to debug. PR's welcome.

check prerequisites

is it necessary to write a script doing something like checking prerequisites to avoid the import error, such as "ImportError: No module named librosa".

u-law encoding

Paper indicates wavenet encodes waveform in u-law encoding. :)

https://github.com/ritheshkumar95/WaveNet/blob/master/dataset.py#L177

Storing audio checkpoints when generating

It's pretty annoying to have to wait for the generation of the entire audio waveform before being able to inspect the output.
It would make sense to store the current output periodically.

Add fast wavenet generation.

@ibab I noticed you saw the efficient wavenet generation implementation I wrote with my friends:

https://github.com/tomlepaine/fast-wavenet

Can we help you add it to tensorflow-wavenet?

SyntaxError: non-keyword arg after keyword arg

When I try to run train.py, even with the --help flag, I recieve the error:
File "train.py", line 40 parser.add_argument('--num_steps', type=int, default-NUM_STEPS, SyntaxError: non-keyword arg after keyword arg
and am thus unable to train the network.
All dependencies (tensorflow, FFmpeg, etc.) are installed.

Calculation of the loss

In each step, the calculation of the loss is based on the cross entropy of the input value and the predicted value. Each data point in the predicted value is predicted based on its receptive field in the input value. During prediction, the input value is padded on the left with many 0s to implement the casual convolution. Thus, the beginning predicted data points are actually predicted based on those padding 0s, and may not be the same as the input values even when the model is trained for many steps. Is this the reason that why the loss drops to around 2 and cannot drop lower? Maybe it is more reasonable to calculate the loss not from the beginning data point, but from the N-th point where N is the receptive field.

Channel size

Here you say that you will quantize the channels to 256 possible amplitude values (as is mentioned on page 3 in the original paper). When you run the quantization in the preprocessing step you cast the data from the mu-law companding transformation to a tf.int32 which can take on 4294967296 different values. Am I mistaken or should this be cast to a tf.int8 instead?

Can't generate samples from checkpoint file

When I try to run generate.py per the readme, I get this:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
Restoring model from model.ckpt-250
Traceback (most recent call last):
  File "generate.py", line 86, in <module>
    main()
  File "generate.py", line 66, in main
    feed_dict={samples: window})
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 710, in run
    run_metadata_ptr)
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 908, in _run
    feed_dict_string, options, run_metadata)
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 958, in _do_run
    target_list, options, run_metadata)
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 978, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Output dimensions must be positive
         [[Node: wavenet/dilated_stack/layer1/conv_filter/BatchToSpace = BatchToSpace[T=DT_FLOAT, block_size=2, _device="/job:localhost/replica:0/task:0/gpu:0"](wavenet/dilated_stack/layer1/conv_filter, wavenet/dilated_stack/layer1/conv_filter/BatchToSpace/crops)]]
Caused by op u'wavenet/dilated_stack/layer1/conv_filter/BatchToSpace', defined at:
  File "generate.py", line 86, in <module>
    main()
  File "generate.py", line 51, in main
    next_sample = net.predict_proba(samples)
  File "/home/ubuntu/jupyter_base/project/tensorflow-wavenet/wavenet.py", line 154, in predict_proba
    raw_output = self._create_network(encoded)
  File "/home/ubuntu/jupyter_base/project/tensorflow-wavenet/wavenet.py", line 112, in _create_network
    self.dilation_channels)
  File "/home/ubuntu/jupyter_base/project/tensorflow-wavenet/wavenet.py", line 51, in _create_dilation_layer
    name="conv_filter")
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 228, in atrous_conv2d
    block_size=rate)
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 308, in batch_to_space
    block_size=block_size, name=name)
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2317, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/ubuntu/jupyter_base/venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1239, in __init__
    self._traceback = _extract_stack()```

What should output wave file sound like?

From the model of mine trained 1999 steps(It might be so little steps to sound normally),
It sounds just like noises.

It would be better to give well-trained example output for understanding desired output.

Clearer output

Comparing to https://github.com/basveeling/wavenet example output, their implementation sounds much more clear.

Has anybody managed to produce clear output?

Bug with trimming

I train the network with default hyper-parameters:

Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "/home/jesse/tensorflow_workspace/tensorflow-wavenet/audio_reader.py", line 87, in thread_main audio = trim_silence(audio[:, 0]) File "/home/jesse/tensorflow_workspace/tensorflow-wavenet/audio_reader.py", line 47, in trim_silence return audio[indices[0]:indices[-1]] IndexError: index 0 is out of bounds for axis 0 with size 0

The error occurs at the last line of trim_silence method:

return audio[indices[0]:indices[-1]]

It seems that some audio consists of only silence and is blank after trimming.

Linguistic features for p280 speaker

Hi,

I generated the linguistic features as mentioned in the WaveNet paper for p280 speaker. If anyone is interested to use them for conditioning in WaveNet, please download via https://users.aalto.fi/~bollepb1/binary_labels_p280.zip. Each frame or row corresponds to 5ms of speech.

more fine-grained handling of audio corpora

this is related to #104.

i'm thinking ahead perhaps to when this repo supports conditioning on labels. since most interesting audio sets may not be so neatly organized as VCTK, you can potentially trim your audio and create labels via analysis.

even without labels, it's useful to be able to train on a subset of a folder, or even subsets of individual audio files. one workflow I am developing is to isolate a segment of interest, then find audio chunks which are similar to it (using mel bands or other DSP stats useful in music information retrieval). it can be a bit heavy-handed but it can help weed out noise or other undesirable material.

it might be beyond the scope of what this repo is aiming for, in which case I'll just keep developing it separately. but one useful feature that I'd propose to start is simply letting you specify your training set in a text file rather than just a directory. maybe like a JSON file with paths to audio files and a list of subsegments to pull out. some supporting scripts can be used to generate it, e.g. a function which takes an audio file and time interval as input, and produces a JSON of T seconds of audio segments from a directory which are most similar to the input.

RuntimeError: ('Coordinator stopped with threads still running: %s', 'Thread-4 Thread-6 Thread-8')

Traceback (most recent call last):
File "train.py", line 171, in
main()
File "train.py", line 167, in main
coord.join(threads)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 325, in join
" ".join(stragglers))
RuntimeError: ('Coordinator stopped with threads still running: %s', 'Thread-4 Thread-6 Thread-8')

I tried with the solution in http://stackoverflow.com/questions/36210162/tensorflow-stopping-threads-via-coordinator-seems-not-to-work, but now work.

ibab / tensorflow-wavenet Goto Github PK

tensorflow-wavenet's Introduction

A TensorFlow implementation of DeepMind's WaveNet paper

Requirements

Training the network

Global Conditioning

Training with Global Conditioning

Generating audio

Generating without Global Conditioning

Generating with Global Conditioning

Running tests

Missing features

Related projects

tensorflow-wavenet's People

Contributors

Stargazers

Watchers

Forkers

tensorflow-wavenet's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs