GithubHelp home page GithubHelp logo

audio-super-res's Introduction

Audio Super Resolution Using Neural Networks

This repository implements the audio super-resolution model proposed in:

S. Birnbaum, V. Kuleshov, Z. Enam, P. W.. Koh, and S. Ermon. Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations. NeurIPS 2019
V. Kuleshov, Z. Enam, and S. Ermon. Audio Super Resolution Using Neural Networks. ICLR 2017 (Workshop track)

Installation

Requirements

The model is implemented in Python 3.7.10 and uses several additional libraries.

  • tensorflow==2.4.1
  • keras==2.4.0
  • numpy==1.19.5
  • scipy==1.6.0
  • librosa==0.8.3
  • h5py==2.10.0
  • matplotlib==3.3.4

A full list of the packages on our enviornment is in requirements.txt

Setup

To install this package, simply clone the git repo:

git clone https://github.com/kuleshov/audio-super-res.git;
cd audio-super-res;
conda env create -f environment.yaml
conda activate audio-super-res

Running the model

Contents

The repository is structured as follows.

  • ./src: model source code
  • ./data: code to download the model data

Retrieving data

The ./data subfolder contains code for preparing the VCTK speech dataset. Make sure you have enough disk space and bandwidth (the dataset is over 18G, uncompressed). You need to type:

cd ./data/vctk;
make;

Next, you must prepare the dataset for training: you will need to create pairs of high and low resolution sound patches (typically, about 0.5s in length). We have included a script called prep_vctk.py that does that, which works as follows.

usage: prep_vctk.py [-h] [--file-list FILE_LIST] [--in-dir IN_DIR] [--out OUT]
                    [--scale SCALE] [--dimension DIMENSION] [--stride STRIDE]
                    [--interpolate] [--low-pass] [--batch-size BATCH_SIZE]
                    [--sr SR] [--sam SAM]

optional arguments:
  -h, --help            show this help message and exit
  --file-list FILE_LIST
                        list of input wav files to process
  --in-dir IN_DIR       folder where input files are located
  --out OUT             path to output h5 archive
  --scale SCALE         scaling factor
  --dimension DIMENSION
                        dimension of patches (use -1 for no patching)
  --stride STRIDE       stride when extracting patches
  --interpolate         interpolate low-res patches with cubic splines
  --low-pass            apply low-pass filter when generating low-res patches
  --batch-size BATCH_SIZE
                        we produce # of patches that is a multiple of batch
                        size
  --sr SR               audio sampling rate  
  --sam SAM             subsampling factor for the data (only applicable for multispeaker data)

The output of the data preparation step are two .h5 archives containing, respectively, the training and validation pairs of high/low resolution sound patches. You can also generate these by running make in the corresponding directory, e.g.

cd ./speaker1;
make;

This will use a set of default parameters.

To generate the files needed for the training example below, run the following from the speaker1 directory:

python ../prep_vctk.py \
  --file-list  speaker1-train-files.txt \
  --in-dir ../VCTK-Corpus/wav48/p225 \
  --out vctk-speaker1-train.4.16000.8192.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \
  --low-pass

python ../prep_vctk.py \
  --file-list speaker1-val-files.txt \
  --in-dir ../VCTK-Corpus/wav48/p225 \
  --out vctk-speaker1-val.4.16000.8192.4096.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \
  --low-pass

python ../prep_vctk.py \
  --file-list  speaker1-train-files.txt \
  --in-dir ../VCTK-Corpus/wav48/p225 \
  --out vctk-speaker1-train.4.16000.-1.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 4096 \
  --interpolate \
  --low-pass

python ../prep_vctk.py \
  --file-list speaker1-val-files.txt \
  --in-dir ../VCTK-Corpus/wav48/p225 \
  --out vctk-speaker1-val.4.16000.-1.4096.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 4096 \
  --interpolate \
  --low-pass

Audio super resolution tasks

We have included code to prepare two datasets.

  • The single-speaker dataset consists only of VCTK speaker #1; it is relatively quick to train a model (a few hours).
  • The multi-speaker dataset uses the last 8 VCTK speakers for evaluation, and the rest for training; it takes several days to train the model, and several hours to prepare the data.

We suggest starting with the single-speaker dataset.

Training the model

Running the model is handled by the src/run.py script.

usage: run.py train [-h] --train TRAIN --val VAL [-e EPOCHS]
                    [--batch-size BATCH_SIZE] [--logname LOGNAME]
                    [--layers LAYERS] [--alg ALG] [--lr LR] [--model MODEL] 
                    [--r R] [--piano PIANO] [--grocery GROCERY]

optional arguments:
  -h, --help            show this help message and exit
  --train TRAIN         path to h5 archive of training patches
  --val VAL             path to h5 archive of validation set patches
  -e EPOCHS, --epochs EPOCHS
                        number of epochs to train
  --batch-size BATCH_SIZE
                        training batch size
  --logname LOGNAME     folder where logs will be stored
  --layers LAYERS       number of layers in each of the D and U halves of the
                        network
  --alg ALG             optimization algorithm
  --lr LR               learning rate
  --model               the model to use for training (audiounet, audiotfilm, 
                                                       dnn, or spline). Defaults to audiounet.
  --r                   the upscaling ratio of the data: make sure that the appropriate 
                        datafile have been generated (note: to generate data with different
                        scaling ratios change the SCA parameter in the makefile)
  --piano               false by default--make true to train on piano data 
  --grocery             false by default--make true to train on grocery imputation data
  --speaker              number of speakers being trained on (single or multi). Defaults to single
  --pools_size          size of pooling window
  --strides             size of pooling strides
  --full                false by default--whether to calculate the "full" snr after each epoch. The "full" snr 
                        is the snr acorss the non-patched data file, rather than the average snr over all the 
                        patches which is calculated by default

Note: to generate the data needed for the grocery imputation experiment, download train.csv.7z from https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data into the data/grocery/grocery directory, unzip the csv, and run prep_grocery.py from the data/grocery directory.

For example, to run the model on data prepared for the single speaker dataset, you would type:

python run.py train \
  --train ../data/vctk/speaker1/vctk-speaker1-train.4.16000.8192.4096.h5 \
  --val ../data/vctk/speaker1/vctk-speaker1-val.4.16000.8192.4096.h5 \
  -e 120 \
  --batch-size 64 \
  --lr 3e-4 \
  --logname singlespeaker \
  --model audiotfilm \
  --r 4 \
  --layers 4 \
  --piano false \
  --pool_size 2 \
  --strides 2
  --full true

The above run will store checkpoints in ./singlespeaker.lr0.000300.1.g4.b64.

Note on the models: audiotfilm is the best model.

Pre-Trained Model

See the below link for a pre-trained single-speaker model. This model was trained with the following parameters:

python run.py train \
  --train ../data/vctk/speaker1/vctk-speaker1-train.4.16000.8192.4096.h5 \
  --val ../data/vctk/speaker1/vctk-speaker1-val.4.16000.8192.4096.h5 \
  -e 2   --batch-size 16  --lr 3e-4   --logname singlespeaker \
  --model audiotfilm   --r 4   --layers 4   --piano false \   
  --pool_size 2   --strides 2

https://drive.google.com/file/d/1pqIaxtZpt9GRc-Yp1zCzVoSbQFSLnERF/view?usp=sharing

To use the model, unzip the file in the src directory and run eval with the logname corresponding to the checkpoint file.

Testing the model

The run.py command may be also used to evaluate the model on new audio samples.

usage: run.py eval [-h] --logname LOGNAME [--out-label OUT_LABEL]
                   [--wav-file-list WAV_FILE_LIST] [--r R] [--sr SR]

optional arguments:
  -h, --help            show this help message and exit
  --logname LOGNAME     path to training checkpoint
  --out-label OUT_LABEL
                        append label to output samples
  --wav-file-list WAV_FILE_LIST
                        list of audio files for evaluation
  --r R                 upscaling factor
  --sr SR               high-res sampling rate

In the above example, we would type:

python run.py eval \
  --logname ./singlespeaker.lr0.000300.1.g4.b64/model.ckpt-20101 \
  --out-label singlespeaker-out \
  --wav-file-list ../data/vctk/speaker1/speaker1-val-files.txt \
  --r 4 \
  --pool_size 2 \
  --strides 2 \
  --model audiotfilm

This will look at each file specified via the --wav-file-list argument (these must be high-resolution samples), and create for each file f.wav three audio samples:

  • f.singlespeaker-out.hr.wav: the high resolution version
  • f.singlespeaker-out.lr.wav: the low resolution version processed by the model
  • f.singlespeaker-out.pr.wav: the super-resolved version

These will be found in the same folder as f.wav. Because of how our model is defined, the number of samples in the input must be a multiple of 2**downscaling_layers; if that's not the case, we will clip the input file (potentially shortening it by a fraction of a second).

Disclaimer: We recently upgraded the versions of many of the packages, including Keras and Tensorflow. The example workflow for training and predicting should work, but the codebase has not been fully tested. Please create an issue if you run into any errors.

Keras Layer

keras_layer.py implements the TFiLM layer as a customer Keras layer. The below code illustrates how to use this custom layer.

import numpy as np
from keras.layers import Input, Dense, Flatten, Lambda
from keras.models import Model
import tensorflow as tf

### Insert definition of TFiLM layer here. ####

x = np.random.random((2, 100, 50))
y = np.zeros((2))

inputs = Input(shape=(100, 50))
l = TFiLM(2)(inputs)
l = Flatten()(l)
outputs = Dense(1, activation='sigmoid')(l)


# This creates a model that includes
# the Input layer, a TFILM layer, and a dense layer
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
print(model.summary())
model.fit(x, y, epochs=10)  # starts training

Remarks

We would like to emphasize a few points.

  • Machine learning algorithms are only as good as their training data. If you want to apply our method to your personal recordings, you will most likely need to collect additional labeled examples.
  • You will need a very large model to fit large and diverse datasets (such as the 1M Songs Dataset)
  • Interestingly, super-resolution works better on aliased input (no low-pass filter). This is not reflected well in objective benchmarks, but is noticeable when listening to the samples. For applications like compression (where you control the low-res signal), this may be important.
  • More generally, the model is very sensitive to how low resolution samples are generated. Even the type of low-pass filter (Butterworth, Chebyshev) will affect performance.

Extensions

The same architecture can be used on many time series tasks outside the audio domain. We have successfully used it to impute functional genomics data and denoise EEG recordings.

Feedback

Send feedback to Sawyer Birnbaum.

audio-super-res's People

Contributors

jimmsta avatar kuleshov avatar lootwig avatar sawyerb avatar wmramadan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

audio-super-res's Issues

makefile error in speaker1

mv vctk-speaker1-val.4.16000.-1.4096.h5.tmp vctk-speaker1-val.4.16000.-1.4096.h5
process_begin: CreateProcess(NULL, mv vctk-speaker1-val.4.16000.-1.4096.h5.tmp vctk-speaker1-val.4.16000.-1.4096.h5, ...) failed.
make (e=2): The system cannot find the file specified.
make[1]: *** [vctk-speaker1-val.4.16000.-1.4096.h5] Error 2
make[1]: Leaving directory `C:/Users/DJI12/OneDrive/Desktop/audio-super-res/data/vctk/speaker1'
make: *** [patches] Error 2

no clue why it cannot find the file.

Error at audiofilm.py: ValueError: Unexpectedly found an instance of type `<class 'tensorflow.python.framework.ops.Tensor'>`. Expected a symbolic tensor instance.

Hi,
I'm getting the following error, and have no idea how to proceed.
Any help would be much appreciated.
Thanks!

training...
List of arrays in input file: ['data', 'label']
Shape of X: (3328, 8192, 1)
Shape of Y: (3328, 8192, 1)
List of arrays in input file: ['data', 'label']
Shape of X: (384, 8192, 1)
Shape of Y: (384, 8192, 1)
audiotfilm
building model...
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/keras/engine/base_layer.py", line 279, in assert_input_compatibility
    K.is_keras_tensor(x)
  File "/usr/lib/python3/dist-packages/keras/backend/theano_backend.py", line 221, in is_keras_tensor
    str(type(x)) + '`. '
ValueError: Unexpectedly found an instance of type `<class 'tensorflow.python.framework.ops.Tensor'>`. Expected a symbolic tensor instance.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/run.py", line 185, in <module>
    main()
  File "src/run.py", line 182, in main
    args.func(args)
  File "src/run.py", line 128, in train
    model = get_model(args, n_dim, r, from_ckpt=False, train=True)
  File "src/run.py", line 166, in get_model
    strides=args.strides, opt_params=opt_params, log_prefix=args.logname)
  File "/cs/labs/adiyoss/moshemandel/audio-super-res-2/src/models/audiotfilm.py", line 31, in __init__
    opt_params=opt_params, log_prefix=log_prefix)
  File "/cs/labs/adiyoss/moshemandel/audio-super-res-2/src/models/model.py", line 48, in __init__
    self.predictions = self.create_model(n_dim, r)
  File "/cs/labs/adiyoss/moshemandel/audio-super-res-2/src/models/audiotfilm.py", line 89, in create_model
    activation=None, padding='same', kernel_initializer=Orthogonal()))(x)
  File "/usr/lib/python3/dist-packages/keras/engine/base_layer.py", line 414, in __call__
    self.assert_input_compatibility(inputs)
  File "/usr/lib/python3/dist-packages/keras/engine/base_layer.py", line 285, in assert_input_compatibility
    str(inputs) + '. All inputs to the layer '
ValueError: Layer conv1d_1 was called with an input that isn't a symbolic tensor. Received type: <class 'tensorflow.python.framework.ops.Tensor'>. Full input: [<tf.Tensor 'X:0' shape=(None, None, 1) dtype=float32>]. All inputs to the layer should be tensors.

Where can the output be found?

I am using the pretrained model, and i run the command:

python run.py eval \
  --logname ./singlespeaker.lr0.000300.1.g4.b64/model.ckpt-20101 \
  --out-label singlespeaker-out \
  --wav-file-list ../data/vctk/speaker1/speaker1-val-files.txt \
  --r 4 \
  --pool_size 2 \
  --strides 2 \
  --model audiotfilm

and get the following stdout:

2021-12-15 06:02:59.919846: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: ca
nnot open shared object file: No such file or directory
2021-12-15 06:02:59.919934: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Using TensorFlow backend.
audiotfilm
2021-12-15 06:03:03.419824: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-15 06:03:03.419980: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open 
shared object file: No such file or directory
2021-12-15 06:03:03.420006: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-15 06:03:03.420041: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (final-project): /proc/driver/
nvidia/version does not exist
2021-12-15 06:03:03.420379: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t
he following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-15 06:03:03.421121: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-15 06:03:04.609327: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-12-15 06:03:04.888640: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2299995000 Hz
p225_355.wav
p225_356.wav
p225_357.wav
p225_358.wav
p225_359.wav
p225_363.wav
p225_365.wav
p225_366.wav
etc..

But when it's finished running i don't see output files to analyse. In addition, how do i get the LSD and SNR score?

Can I improve audio of a speech entirely - not interested in training

Hello. I have full lectures on my youtube channel

E. G. A lecture : https://youtu.be/XLsrsCCdSnU

I want to improve audio quality

What script do I need to provide for improving entire speech without any trimming? I need to keep it snyced with video

I can extract audi as wav with ffmpeg

are there only 1 pre trained model? If not which pre trained model is best?

I can setup and install with depensincies

But before spending time I want to learn possible to improve long recordings or not and what inference command is necessary

Thank You so much

ValueError: Shapes (4, 128, 128) and () are incompatible

Hi,

I know that the authors of this project have moved on, but I am still curious if anyone has ran into a similar issue regarding the shapes for the LSTM network.

I am only attempting to train the single-speaker.

First, I had to fork the project and fix the Makefile so that it pulls the VCTK-Corpus dataset from a different source (http://www.udialogue.org/download/VCTK-Corpus.tar.gz).

Second, I had to downgrade joblib to 0.11 due to a bug in the most recent Python2.7-compatible version.

This allowed me to download, make, and perform the downsampling on the dataset using the following parameters (the default parameters in the Makefile would simply not work for me and produce empty .h5 files):

sca = 4
sr = 16000

tr_dim = 8192
tr_str = 4096

va_dim = 8192
va_str = 4096

Now, whenever I am trying to train using the following (ignore the $output_name variables):

!python run.py train 
  --train ../data/vctk/speaker1/$train_output_name
  --val ../data/vctk/speaker1/$val_output_name -e 120 
  --batch-size 64 
  --lr 3e-4 
  --logname singlespeaker 
  --model audiotfilm 
  --r 4 
  --layers 4 
  --piano false
  --pool_size 8 
  --strides 8 
  --full true

I receive this error:

/content/gdrive/My Drive/audio-super-res/src
Using TensorFlow backend.
List of arrays in input file: [u'data', u'label']
Shape of X: (3328, 8192, 1)
Shape of Y: (3328, 8192, 1)
List of arrays in input file: [u'data', u'label']
Shape of X: (384, 8192, 1)
Shape of Y: (384, 8192, 1)
audiotfilm
building model...
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Traceback (most recent call last):
  File "run.py", line 172, in <module>
    main()
  File "run.py", line 169, in main
    args.func(args)
  File "run.py", line 119, in train
    model = get_model(args, n_dim, r, from_ckpt=False, train=True)
  File "run.py", line 155, in get_model
    strides=args.strides, opt_params=opt_params, log_prefix=args.logname)  
  File "/content/gdrive/My Drive/audio-super-res/src/models/audiotfilm.py", line 29, in __init__
    opt_params=opt_params, log_prefix=log_prefix)
  File "/content/gdrive/My Drive/audio-super-res/src/models/model.py", line 48, in __init__
    self.predictions = self.create_model(n_dim, r)
  File "/content/gdrive/My Drive/audio-super-res/src/models/audiotfilm.py", line 95, in create_model
    x_norm = _make_normalizer(x, nf, nb)
  File "/content/gdrive/My Drive/audio-super-res/src/models/audiotfilm.py", line 60, in _make_normalizer
    x_rnn = LSTM(output_dim = n_filters, return_sequences = True)(x_in_down)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 546, in __call__
    self.build(input_shapes[0])
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py", line 784, in build
    self.W = K.concatenate([self.W_i, self.W_f, self.W_c, self.W_o])
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 1427, in concatenate
    return tf.concat(axis, [to_dense(x) for x in tensors])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1254, in concat
    tensor_shape.scalar())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_shape.py", line 1023, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (4, 128, 128) and () are incompatible

Anyone see something like this before? Thanks!

parameters which gave best results on multi speaker

Hi, can you please share with some training / data preparation parameters which gave best results? Most of them are not given in the paper or unclear.
For example: dimension (input vector size), stride step, batch size, number of layers, learning rate, number of epochs in training

Thanks in advance

MULT-SPEAKER HELP

Hi,
I want to know some details about the multi-speaker training. In specific, what changes are present in the architecture/data preparation for single vs multi-speaker dataset. Are you merging multiple speakers in one audio clip or not. Please explain the steps for training a multi-speaker custom dataset.

makefile error

when i run
PATH\audio-super-res-master\data\vctk> make;
make VCTK-Corpus
make[1]: Entering directory 'PATH/audio-super-res-master/data/vctk'
wget http://www.udialogue.org/download/VCTK-Corpus.tar.gz
process_begin: CreateProcess(NULL, wget http://www.udialogue.org/download/VCTK-Corpus.tar.gz, ...) failed.
make (e=2): The system cannot find the file specified.
make[1]: *** [Makefile:13: VCTK-Corpus.tar.gz] Error 2
make[1]: Leaving directory 'PATH/audio-super-res-master/data/vctk'
make: *** [Makefile:7: all] Error 2

is this a problem from my side? idk but I think it is from the dataset

make file does not download data

calling the make file does just wait for a http request reponse... is the link down?
calling the link from browser does also load till timeout.

can someone provide a link?

wget http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz
--2019-10-25 17:40:26-- http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz
Resolving homepages.inf.ed.ac.uk (homepages.inf.ed.ac.uk)... 129.215.32.113
Connecting to homepages.inf.ed.ac.uk (homepages.inf.ed.ac.uk)|129.215.32.113|:80... connected.
HTTP request sent, awaiting response...

estimated release date?

If this technique is real, than it's amazing. I'm curious how long it's going to take before you feel comfortable releasing the tech publicly so that people can input their LQ audio files and get something better out of them...

Unclear instructions with pretrained model

So I've downloaded the pretrained model (thank you) and was just reading the instructions here;

https://github.com/kuleshov/audio-super-res#running-the-model

"Running the model" as follows;

Contents

The repository is structured as follows.

    ./src: model source code
    ./data: code to download the model data

Retrieving data

The ./data subfolder contains code for preparing the VCTK speech dataset. Make sure you have enough disk space and bandwidth (the dataset is over 18G, uncompressed). You need to type:

cd ./data/vctk;
make;

Next, you must prepare the dataset for training:

A little confused as I do not need to train the dataset now that I have the pretrained model?

Cheers!

Resize-convolution instead of subpixel convolution

Hi,

Thanks for this implementation.

Regarding the sub-pixel convolution approach for up-sampling the signal .. do you think that this was better than nearest-neighbor resize followed by convolution as argued by Odena here ? Because he says that the latter is better.

Thanks in advance

Reproducing results for multispeaker model

Hi!

I am trying to reproduce the results of the paper.

For data preparation I use the following commands:

python ../prep_vctk.py \
  --file-list  train-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-train.4.16000.8192.8192.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 8192 \
  --interpolate \
  --batch-size 64 \
  --sam 0.25

python ../prep_vctk.py \
  --file-list val-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-val.4.16000.8192.8192.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 8192 \
  --interpolate \
  --batch-size 64


python ../prep_vctk.py \
  --file-list  train-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-train.4.16000.-1.8192.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 8192 \
  --interpolate \
  --batch-size 64 \
  --sam 0.25

python ../prep_vctk.py \
  --file-list val-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-val.4.16000.-1.8192.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 8192 \
  --interpolate \
  --batch-size 64

For training I use the following command:

python3 run.py train --train ../data/vctk/multispeaker/vctk-multispeaker-train.4.16000.8192.8192.h5 --val ../data/vctk/multispeaker/vctk-multispeaker-val.4.16000.8192.8192.h5.tmp -e 50 --batch-size 64 --lr 3e-4 --logname multispeaker --model audiotfilm --r 4 --layers 4 --piano false --speaker multi --pool_size 2 --strides 2 --full true

After training I use the following command to infer the model:

python run.py eval --logname ./model.ckpt-53351 --out-label mul-out --wav-file-list ./test_files.txt --r 4 --pool_size 2 --strides 2 --model audiotfilm --speaker multi

I got a poor performance of your model, spectrograms look like this (predicted is above, gt is below):

image

I doubt that this behavior is expected, could you please give me a hint where I may go wrong or better provide the model checkpoint for the multi-speaker model?

Thank you in advance!

Why x_sp will be cut some data in predict?

In audiounet.py

def predict(self, X):
    assert len(X) == 1
    x_sp = spline_up(X, self.r) 
    x_sp = x_sp[:len(x_sp) - (len(x_sp) % (2**(self.layers+1)))] # what does these code meaning?
    X = x_sp.reshape((1,len(x_sp),1))
    feed_dict = self.load_batch((X,X), train=False)
    return self.sess.run(self.predictions, feed_dict=feed_dict)

Cuold you please tell me why x_sp will be cut like this?
Does it mean the length of X will be changed by through the whole nereus network?

Installation is hard for average user.

Hi!
Thank you for this repository. I have a suggestion. Most users are looking for something executable and usable out of the box. For example this repository https://github.com/xinntao/Real-ESRGAN which is a wonderful image up-scaling AI, has pre-trained builds for all platforms.

What I said above might take a bit of time but until that moment, it would be much nicer to use Tensorflow-Docker https://www.tensorflow.org/install/docker so to save huge time getting this AI to work locally.

Regards,
Alixsep

error when building mode

the messege is like below:
building model...
D-Block: (?, ?, 128)
D-Block: (?, ?, 256)
D-Block: (?, ?, 512)
D-Block: (?, ?, 512)
Traceback (most recent call last):
File "run.py", line 111, in
main()
File "run.py", line 108, in main
args.func(args)
File "run.py", line 73, in train
model = get_model(args, n_dim, r, from_ckpt=False, train=True)
File "run.py", line 102, in get_model
opt_params=opt_params, log_prefix=args.logname)
File "/home/audio-super-res-master/src/models/audiounet.py", line 27, in init
opt_params=opt_params, log_prefix=log_prefix)
File "/home/audio-super-res-master/src/models/model.py", line 47, in init
self.predictions = self.create_model(n_dim, r)
File "/home/audio-super-res-master/src/models/audiounet.py", line 80, in create_model
x = merge([x, l_in], mode='concat', concat_axis=-1)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.1-py2.7.egg/keras/engine/topology.py", line 1690, in merge
return merge_layer(inputs)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.1-py2.7.egg/keras/engine/topology.py", line 1463, in call
return self.call(inputs, mask)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.1-py2.7.egg/keras/engine/topology.py", line 1394, in call
return K.concatenate(inputs, axis=self.concat_axis)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.1-py2.7.egg/keras/backend/tensorflow_backend.py", line 1427, in concatenate
return tf.concat(axis, [to_dense(x) for x in tensors])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1096, in concat
dtype=dtypes.int32).get_shape().assert_is_compatible_with(
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 836, in convert_to_tensor
as_ref=False)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 926, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 208, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 383, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 303, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

the code is run on ubunt14.04 with tf1.4.0
how can i correct this error? thanks.

Upload a pre trained model?

Can a pre trained model be uploaded to the repo? Some of us dont have enough computation power to data such a massive model.

Pooling window and pooling strides

Do you use pooling window and pooling strides in your code ? They are provided as arguments. They are supposed to be both 8 but I have trouble finding where you use them.

OOM when trained on GeForce GTX 1080

@kuleshov Thanks for your share! When I try to train model with VCTK corpus (1 speaker), I met the OOM problem as below:

`2018-05-29 06:19:04.735604: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 11 Chunks of size 134217728 totalling 1.38GiB
2018-05-29 06:19:04.735621: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 21 Chunks of size 268435456 totalling 5.25GiB
2018-05-29 06:19:04.735642: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 391118848 totalling 373.00MiB
2018-05-29 06:19:04.735659: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 469762048 totalling 448.00MiB
2018-05-29 06:19:04.735677: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 536870912 totalling 512.00MiB
2018-05-29 06:19:04.735693: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 9.63GiB
2018-05-29 06:19:04.735715: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 10968950375
InUse: 10343534592
MaxInUse: 10766956032
NumAllocs: 267
MaxAllocSize: 4026531840

2018-05-29 06:19:04.735786: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***********************************************************x__
2018-05-29 06:19:04.735827: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[128,2048,512]
Traceback (most recent call last):
File "run.py", line 111, in
main()
File "run.py", line 108, in main
args.func(args)
File "run.py", line 76, in train
model.fit(X_train, Y_train, X_val, Y_val, n_epoch=args.epochs)
File "audio-super-res/src/models/model.py", line 217, in fit
tr_objective = self.train(feed_dict)
File "audio-super-res/src/models/model.py", line 257, in train
_, loss = self.sess.run([self.train_op, self.loss], feed_dict=feed_dict)
File "anaconda3/envs/audio-super-res/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "anaconda3/envs/audio-super-res/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "anaconda3/envs/audio-super-res/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "anaconda3/envs/audio-super-res/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,2048,512]
[[Node: generator/upsc_conv1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](generator/upsc_conv1/subpixel/transpose_1, generator/downsc_conv1/sub, g
enerator/upsc_conv1/concat/axis)]]`

Train command is:
python -u run.py train --train ../data/vctk/speaker1/vctk-speaker1-train.4.16000.8192.4096.h5 --val ../data/vctk/speaker1/vctk-speaker1-val.4.16000.8192.4096.h5 -e 120 --batch-size 128 --lr 3e-4 --logname singlespeaker

And the dimention of train and test data is:
List of arrays in input file: [u'data', u'label']
Shape of X: (3328, 8192, 1)
Shape of Y: (3328, 8192, 1)
List of arrays in input file: [u'data', u'label']
Shape of X: (384, 8192, 1)
Shape of Y: (384, 8192, 1)

The environment configuration is:
tensorflow-gpu==0.12.1
keras==1.2.1
numpy==1.12.0
scipy==0.18.1
librosa==0.4.3
h5py==2.6.0

The data is not very huge, why occur OOM problem? I don't understand. Do you know ?

Memory is not enough

The memory is not enough when preprocessing the multispeakers dataset and training, and it takes up too much memory. Is there any solution? My computer does not have that much memory.

LSTM kernel not compatible with cuDNN and GPU utilization is low

I'm currently trying to train the multi-speaker/audiotfilm setup.

I'm with CUDA 11 on a box with 4 pascal GPUs.
When I launch the training process, I see messages like this for all the layers:

WARNING:tensorflow:Layer lstm_5 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

Is that expected/normal? The model is training though.

Also, I notice that the only GPU really used (the other available GPUs just have a bit of memory taken onto, but nothing really happens there) isn't used at full capacity. I see mostly 50% usage. I suspect that it is because of the way the data is loaded. We're constantly waiting for the data to be loaded on the GPU.

Is is possible to optimize that part?

How to flip phase if designing ASR in frequency domain instead of time domain?

Hi Professor

I am thinking of re_implementing an ASR model with very similar structure but in frequency domain. Basically i will first apply STFT to the signal and split the specgram to log power map + phase map. During ASR, i only predict log power map, and expand phase map from lr to hr. However, i am not very clear how to expand the phase map.

In one of the papers i read, the author says fliping and setting negative sign. For example, if ASR ratio is 4, you just flip three times. Say [1,2,3,4,5] is phase map of a signal, in x2 mode, hr phase map should be [1,2,3,4,5,-4,-3-2,-1]; and in x3 mode, the map should be [1,2,3,4,5,-4,-3,-2,-1,1,2,3,4]; and in x4 mode, it become [1,2,3,4,5,-4,-3,-2,-1,1,2,3,4,-4,-3,-2,-1].

To verify, i load a 16KHz audio, apply STFT, split power and phase, slice to 1/4 and flip three times to reconstruct the phase map. Finally through ISTFT i get output audio, but it sounds not right, the output have slightly phase shift.

My question is what is the correct way to expand the phase map when perform ASR?

Would be very appreciate if you can answer my question

Thanks a lot

Mike

Incompatible shapes of X(data) and Y(label)

Hi Volodymyr ,

Thanks for sharing your nice work. I have some problems when I making the preparation of data, I set parameters as like this --stride=4096, --dimension=8192. But after I finish the data preparation, I found that the data(lr_patches)'s dimension and the label(hr_patches)'s dimension don't match. They are (1600, 4096, 1) and (1600, 8192, 1), as you can see, the second dimension is not same.
This problem results at the problem of calculating loss when I train the model, the bug is shown as:

File "/home/disk1/xyzhou/workspace/audio-super-res-py2/audio-super-res/src/models/model.py", line 72, in create_train_op
self.loss = self.create_objective(X, Y, opt_params)
File "/home/disk1/xyzhou/workspace/audio-super-res-py2/audio-super-res/src/models/model.py", line 102, in create_objective
sqrt_l2_loss = tf.sqrt(tf.reduce_mean((P-Y)**2 + 1e-6, axis=[1,2]))

InvalidArgumentError (see above for traceback): Incompatible shapes: [64,4096,1] vs. [64,8192,1];

Sorry to bother you. Please let me know if you have any suggestions. Thank you.

Best regards,
xinyong

Error regarding 'dim_ordering'

Hi Volodymyr ,

Thanks for sharing your nice work. I have successfully ran through the preparation of data, however, I met a bug when trying the run training job by the following command.

python run.py train --train ../data/vctk/speaker1/vctk-speaker1-train.4.16000.8192.4096.h5 --val ../data/vctk/speaker1/vctk-speaker1-val.4.16000.8192.4096.h5 -e 120 --batch-size 2 --lr 3e-4 --logname singlespeaker

The bug that I got is shown as following:

audiounet.py", line 111, in orthogonal_init
return orthogonal(shape, name=name, dim_ordering=dim_ordering)
TypeError: init() got an unexpected keyword argument 'dim_ordering'

I have done a massive search and tried to upgrade the version of Keras to latest one 2.2.0. Albeit, the error still shows up. Thus, could you please let me know if you have any solution to solve this problem? Thank you for your help!

Best regards,
Alice

Building Speaker 1 Set

python ./data/vctk/prep_vctk.py gives me
a type error saying
TypeError: expected string or buffer

A bug in original paper ?

Original paper include incorrect statement at Architectural analysis section.

incorrect (original):

the green-ish line display ...
the yellow curve ....
the green curve ....

correct:

the red line display ...
the green curve ....
the blue curve ....

SNR decrease for speaker1

I trained the model for 4kHz to 8kHz, the LSD can be decrease similar with the paper, But the SNR cannot increase, it becomes decrease.
can you give me some suggestions.
python run.py train
--train ../data/vctk/speaker1/vctk-speaker1-train.2.16000.8192.4096.h5
--val ../data/vctk/speaker1/vctk-speaker1-val.2.16000.8192.4096.h5
--epochs 500
--batch-size 64
--lr 3e-4
--logname singlespeaker
--model audiounet
--r 2
--layers 4
--piano false
--pool_size 2
--strides 2
--full false

Batch normalization is commented out - mistake?

Hi, Thanks for submitting this paper, its super cool.

I came across the lines regards applying batch normalization which are commented out (like on audionet.py:65 - # x = BatchNormalization(mode=2)(x)). On the same time on the paper you mentioned the batch normalization as part of the network architecture.
The comment out is a mistake?

Thanks

Colab?

Will this work with colab? Has anyone made a notebook?

OSError: File ./singlespeaker.lr0.000300.1.g4.b16/model.ckpt-209.data-00000-of-00001.meta does not exist.

Hi,

I followed README.md to create the trained model for a single speaker. The final terminal output of the training was this:

/root/audio-super-res/src/models/model.py:65: FutureWarning: Pass n_fft=2048 as keyword args. From version 0.10 passing these as positional arguments will result in an error
  S = librosa.stft(x, 2048)
List of arrays in input file: ['data', 'label']
Shape of X: (3328, 8192, 1)
Shape of Y: (3328, 8192, 1)
List of arrays in input file: ['data', 'label']
Shape of X: (384, 8192, 1)
Shape of Y: (384, 8192, 1)
audiotfilm
building model...
D-Block:  (None, None, 128)
D-Block:  (None, None, 256)
D-Block:  (None, None, 512)
D-Block:  (None, None, 512)
U-Block:  (None, None, 1024)
U-Block:  (None, None, 1024)
U-Block:  (None, None, 512)
U-Block:  (None, None, 256)
creating train_op with params: {'alg': 'adam', 'lr': 0.0003, 'b1': 0.9, 'b2': 0.999, 'batch_size': 16, 'layers': 4}
Parameters: 68221186
Epoch 1 of 2 took 5730.696s (208 minibatches)
  training l2_loss/segsnr/LSD:      0.007389 15.040772   4.680083
  validation l2_loss/segsnr/LSD:    0.006574 16.088799   4.565674
Epoch 2 of 2 took 12187.555s (208 minibatches)
  training l2_loss/segsnr/LSD:      0.007181 15.214238   4.497891
  validation l2_loss/segsnr/LSD:    0.006336 16.305356   4.398260

These are the trained result files.

Then, I ran the following command to test the trained model:

$ python3 run.py eval --logname ./singlespeaker.lr0.000300.1.g4.b16/model.ckpt-209.data-00000-of-00001 --out-label singlespeaker-out --wav-file-list ../data/vctk/speaker1/speaker1-val-files.txt --r 4 --pool_size 2 --strides 2 --model audiotfilm

Then, I get this error:

OSError: File ./singlespeaker.lr0.000300.1.g4.b16/model.ckpt-209.data-00000-of-00001.meta does not exist.

I tried to forcibly change the filename model.ckpt-209.meta to model.ckpt-209.data-00000-of-00001.meta. Then I get this error1 and error2:

tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./singlespeaker.lr0.000300.1.g4.b16/model.ckpt-209.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

I wonder what could be the reason for this bug. Thanks.

Convolution layer in upsampling block

Hi,

I read this paper and it is pretty good. I am trying to implement this paper. I came across few following questions which I am not able to answer from the paper.

  1. Why convolution layer in upsampling block? Only subpixel upsampling layer is not enough?
  2. What are the parameters of convolution layer in both upsampling and downsampling block? Like number of initial filters in downsampling block, number of filters in upsampling block convs.
  3. What is a filter size you used in convolution layer?
  4. What is B- bottleneck block? Just a pass through from downsampling to upsampling block?

up sampling downsampling block

Thanks. Moreover, please let me know if you are planning to release this code for public. It would be really helpful.

  • Achal

question of results in samples folder

  1. the name of the file *.pr.wav and *.sp.wav . which is the result of the paper's super resolution network? and what is the other?

  2. Sp1.3.4.sp.wav seems has some blind area in the spectrum view on some freqency. why? nearly all *.sp.wav data has this phenomenon.

thanks.

Code parameters not matching with paper

Hi Kuleshov,

I was trying to reproduce the results from the paper. But there are few parameters like filter size and number of filters not there in paper so I am using parameters from your code.

There are few things not matching up with the paper:

  1. Audio length: 6000 in paper but 8192 in code
  2. Filter sizes and strides

Can I reproduce the same results or better results from the paper using the same code?

About the metrics SNR

As I know, in the paper, the metrics of SNR in the paper defined as the follow picture:
image

But in your code, snr defined as:
"snr = 20 * tf.log(sqrn_l2_norm / sqrt_l2_loss + 1e-8) / tf.log(10.)"

About this, I have two questions.

  1. In the paper, there is 10 * log(), but in your formular is 20 * log*()
  2. In the paper, there is nothing about " / tf.log(10.)", but you have it.

I hope that I can get your help, I will be grateful.

Update to keras 2.0 and Tensorflow 1.2

I am using latest keras and tensorflow. I am getting many errors and many deprecation warnings. I can send PR with latest API changes for keras and tensorflow if you are interested!!!

tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 2 must match. Input 1 has shape [128 32 512] and doesn't match input 0 with shape [128 16 512]

Hi!
I recently encountered a problem as follows:
List of arrays in input file: ['data', 'label']
Shape of X: (66688, 8192, 1)
Shape of Y: (66688, 8192, 1)
List of arrays in input file: ['data', 'label']
Shape of X: (5248, 8192, 1)
Shape of Y: (5248, 8192, 1)
audiotfilm
building model...
D-Block: (None, None, 128)
D-Block: (None, None, 256)
D-Block: (None, None, 512)
D-Block: (None, None, 512)
U-Block: (None, None, 1024)
U-Block: (None, None, 1024)
U-Block: (None, None, 512)
U-Block: (None, None, 256)
creating train_op with params: {'alg': 'adam', 'lr': 0.001, 'b1': 0.9, 'b2': 0.999, 'batch_size': 128, 'layers': 4}
WARNING:tensorflow:From /root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/util/tf_should_use.py:247: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.global_variables_initializer instead.
Parameters: 68221186
Traceback (most recent call last):
File "/root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
return fn(*args)
File "/root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
target_list, run_metadata)
File "/root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 2 must match. Input 1 has shape [128 32 512] and doesn't match input 0 with shape [128 16 512].
[[{{node gradients/generator/upsc_conv3/concatenate/concat_grad/ConcatOffset}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 176, in
main()
File "run.py", line 173, in main
args.func(args)
File "run.py", line 125, in train
model.fit(X_train, Y_train, X_val, Y_val, n_epoch=args.epochs, r=args.r, speaker=args.speaker, grocery=args.grocery, piano=args.piano, calc_full_snr = full)
File "/root/xiaojian/two/audio-super-res-master/src/models/model.py", line 249, in fit
tr_objective = self.train(feed_dict)
File "/root/xiaojian/two/audio-super-res-master/src/models/model.py", line 324, in train
[self.train_op, self.loss], feed_dict=feed_dict)
File "/root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
run_metadata_ptr)
File "/root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "/root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
run_metadata)
File "/root/anaconda3/envs/xiaojian/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 2 must match. Input 1 has shape [128 32 512] and doesn't match input 0 with shape [128 16 512].
[[node gradients/generator/upsc_conv3/concatenate/concat_grad/ConcatOffset (defined at /root/xiaojian/two/audio-super-res-master/src/models/model.py:151) ]]
image

Is this problem caused by the environment configuration, or is it?

Error regarding

Hi Volodymyr ,

Thanks for sharing your nice work. I have successfully ran through the preparation of data, however, I met a bug when trying the run training job by the following command.

python run.py train --train ../data/vctk/speaker1/vctk-speaker1-train.4.16000.8192.4096.h5 --val ../data/vctk/speaker1/vctk-speaker1-val.4.16000.8192.4096.h5 -e 120 --batch-size 2 --lr 3e-4 --logname singlespeaker

The bug that I got is shown as following:

audiounet.py", line 111, in orthogonal_init
return orthogonal(shape, name=name, dim_ordering=dim_ordering)
TypeError: init() got an unexpected keyword argument 'dim_ordering'

I have done a massive search and tried to upgrade the version of Keras to latest one 2.2.0. Albeit, the error still shows up. Thus, could you please let me know if you have any solution to solve this problem? Thank you for your help!

Best regards,
Alice

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.