anicolson / deepxi Goto Github PK

Deep Xi: A deep learning approach to a priori SNR estimation implemented in TensorFlow 2/Keras. For speech enhancement and robust ASR.

License: Mozilla Public License 2.0

Python 30.30% Shell 9.56% MATLAB 60.15%

resnet tensorflow speech-enhancement robust-asr deepxi a-priori-snr-estimator mmse minimum-mean-square-error mmse-lsa residual-networks

deepxi's Introduction

Deep Xi: A Deep Learning Approach to A Priori SNR Estimation for speech enhancement.

News

New journal paper:

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation [link] [.pdf]

New trained model:

A trained MHANet is available in the model directory.

New journal paper:

Masked Multi-Head Self-Attention for Causal Speech Enhancement [link] [.pdf]

New journal paper:

Spectral distortion level resulting in a just-noticeable difference between an a priori signal-to-noise ratio estimate and its instantaneous case [link] [.pdf]

New conference paper:

Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement (INTERSPEECH 2021)[link]

Introduction
How does Deep Xi work?
Current networks
Available models
Results
DeepMMSE
Installation
How to use Deep Xi
Current issues and potential areas of improvement
Where can I get a dataset for Deep Xi?
Which audio do I use with Deep Xi?
Citation guide

Introduction

Deep Xi is implemented in TensorFlow 2/Keras and can be used for speech enhancement, noise estimation, mask estimation, and as a front-end for robust ASR. Deep Xi (where the Greek letter 'xi' or ξ is pronounced /zaɪ/ and is the symbol used in the literature for the a priori SNR) is a deep learning approach to a priori SNR estimation that was proposed in [1]. Some of its use cases include:

Minimum mean-square error (MMSE) approaches to speech enhancement.
MMSE-based noise PSD estimators, as in DeepMMSE [2].
Ideal binary mask (IBM) estimation for missing feature approaches.
Ideal ratio mask (IRM) estimation for source separation.
Front-end for robust ASR

How does Deep Xi work?

A training example is shown in Figure 2. A deep neural network (DNN) within the Deep Xi framework is fed the noisy-speech short-time magnitude spectrum as input. The training target of the DNN is a mapped version of the instantaneous a priori SNR (i.e. mapped a priori SNR). The instantaneous a priori SNR is mapped to the interval [0,1] to improve the rate of convergence of the used stochastic gradient descent algorithm. The map is the cumulative distribution function (CDF) of the instantaneous a priori SNR, as given by Equation (13) in [1]. The statistics for the CDF are computed over a sample of the training set. An example of the mean and standard deviation of the sample for each frequency bin is shown in Figure 3. The training examples in each mini-batch are padded to the longest sequence length in the mini-batch. The sequence mask is used by TensorFlow to ensure that the DNN is not trained on the padding. During inference, the a priori SNR estimate is computed from the mapped a priori SNR using the sample statistics and Equation (12) from [2].


Figure 2: A training example for Deep Xi. Generated using `eval_example.m`.


Figure 3: The normal distribution for each frequency bin is computed from the mean and standard deviation of the instantaneous a priori SNR (dB) over a sample of the training set. Generated using `eval_stats.m`

Current networks

Configurations for the following networks can be found in run.sh.

MHANet: Multi-head attention network [6].
RDLNet: Residual-dense lattice network [3].
ResNet: Residual network [2].
ResLSTM & ResBiLSTM: Residual long short-term memory (LSTM) network and residual bidirectional LSTM (ResBiLSTM) network [1].

Deep Xi utilising the MHANet (Deep Xi-MHANet) was proposed in [6]. It utilises multi-head attention to efficiently model the long-range dependencies of noisy speech. Deep Xi-MHANet is shown in Figure 4. Deep Xi utilising a ResNet TCN (Deep Xi-ResNet) was proposed in [2]. It uses bottleneck residual blocks and a cyclic dilation rate. The network comprises of approximately 2 million parameters and has a contextual field of approximately 8 seconds. Deep Xi utilising a ResLSTM network (Deep Xi-ResLSTM) was proposed in [1]. Each of its residual blocks contain a single LSTM cell. The network comprises of approximately 10 million parameters.


Figure 4: (left) Deep Xi-MHANet from [6].

Available models

mhanet-1.1c (available in the model directory)

resnet-1.1n (available in the model directory)

resnet-1.1c (available in the model directory)

Each available model is trained using the Deep Xi dataset. Please see run.sh for more details about these networks.

There are multiple Deep Xi versions, comprising of different networks and restrictions. An example of the ver naming convention is resnet-1.0c. The network type is given at the start of ver. Versions with c are causal. Versions with n are non-causal. The version iteration is also given, i.e. 1.0.

Results

Note: Results for the Deep Xi framework in this repository are reported for Tensorflow 2/Keras. Results in the papers were found using Tensorflow 1. All future work will be completed in Tensorflow 2/Keras.

DEMAND Voice Bank test set

Objective scores obtained on the DEMAND Voicebank test set described here. Each Deep Xi model is trained on the DEMAND Voicebank training set. As in previous works, the objective scores are averaged over all tested conditions. CSIG, CBAK, and COVL are mean opinion score (MOS) predictors of the signal distortion, background-noise intrusiveness, and overall signal quality, respectively. PESQ is the perceptual evaluation of speech quality measure. STOI is the short-time objective intelligibility measure (in %). The highest scores attained for each measure are indicated in boldface.

Method	Gain	Causal	CSIG	CBAK	COVL	PESQ	STOI	SegSNR
Noisy speech	--	--	3.35	2.44	2.63	1.97	92 (91.5)	--
Wiener		Yes	3.23	2.68	2.67	2.22	--	--
SEGAN	--	No	3.48	2.94	2.80	2.16	93	--
WaveNet	--	No	3.62	3.23	2.98	--	--	--
MMSE-GAN	--	No	3.80	3.12	3.14	2.53	93	--
Deep Feature Loss	--	Yes	3.86	3.33	3.22	--	--	--
Metric-GAN	--	No	3.99	3.18	3.42	2.86	--	--
Koizumi2020	--	No	4.15	3.42	3.57	2.99	--	--
T-GSA	--	No	4.18	3.59	3.62	3.06	--	--
Deep Xi-ResLSTM (1.0c)	MMSE-LSA	Yes	4.01	3.25	3.34	2.65	91 (91.0)	8.2
Deep Xi-ResNet (1.0c)	MMSE-LSA	Yes	4.14	3.32	3.46	2.77	93 (93.2)	--
Deep Xi-ResNet (1.0n)	MMSE-LSA	No	4.28	3.46	3.64	2.95	94 (93.6)	--
Deep Xi-ResNet (1.1c)	MMSE-LSA	Yes	4.24	3.40	3.59	2.91	94 (93.5)	8.4
Deep Xi-ResNet (1.1n)	MMSE-LSA	No	4.35	3.52	3.71	3.03	94 (94.1)	9.3
Deep Xi-MHANet (1.0c)	MMSE-LSA	Yes	4.15	3.37	3.48	2.77	93 (93.2)	8.9
Deep Xi-MHANet (1.1c)	MMSE-LSA	Yes	4.34	3.49	3.69	2.99	94 (94.0)	9.1

Deep Xi Test Set

Average objective scores obtained over the conditions in the test set of the Deep Xi dataset. Each Deep Xi model is trained on the test set of the Deep Xi dataet. SNR levels between -10 dB and 20 dB are considered only. Results for each condition can be found in log/results

Method	Gain	Causal	CSIG	CBAK	COVL	PESQ	STOI
Deep Xi-ResNet (1.1c)	MMSE-STSA	Yes	3.14	2.52	2.43	1.82	84.85
Deep Xi-ResNet (1.1c)	MMSE-LSA	Yes	3.15	2.55	2.46	1.85	84.72
Deep Xi-ResNet (1.1c)	SRWF/IRM	Yes	3.12	2.50	2.41	1.79	84.95
Deep Xi-ResNet (1.1c)	cWF	Yes	3.15	2.51	2.44	1.83	84.94
Deep Xi-ResNet (1.1c)	WF	Yes	2.66	2.46	2.12	1.69	83.02
Deep Xi-ResNet (1.1c)	IBM	Yes	1.36	2.16	1.26	1.30	77.57
Deep Xi-ResNet (1.1n)	MMSE-LSA	No	3.30	2.62	2.59	1.97	86.70
Deep Xi-MHANet (1.1c)	MMSE-LSA	Yes	3.45	2.75	2.73	2.08	87.11

DeepMMSE

DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation.

To save noise PSD estimate .mat files from DeepMMSE, please use the following:

./run.sh VER="mhanet-1.1c" INFER=1 GAIN="deepmmse"

Installation

Prerequisites for GPU usage:

To install:

git clone https://github.com/anicolson/DeepXi.git
python3 -m venv --system-site-packages ~/venv/DeepXi
source ~/venv/DeepXi/bin/activate
cd DeepXi
pip install -r requirements.txt

Otherwise, a docker image can be found on Docker Hub: https://hub.docker.com/r/fhoerst/deepxi

How to use Deep Xi

Use run.sh to configure and run Deep Xi. Look at config.sh to set the paths to the dataset, models, and outputs.

Inference: To perform inference and save the outputs, use the following:

./run.sh VER="mhanet-1.1c" INFER=1 GAIN="mmse-lsa"

Please look in thoth/args.py for available gain functions and run.sh for further options.

Testing: To perform testing and get objective scores, use the following:

./run.sh VER="mhanet-1.1c" TEST=1 GAIN="mmse-lsa"

Please look in log/results for the results.

Training:

./run.sh VER="mhanet-1.1c" TRAIN=1

Ensure to delete the data directory before training. This will allow training lists and statistics for your training set to be saved and used. To retrain from a certain epoch, set --resume_epoch in run.sh to the desired epoch.

Current issues and potential areas of improvement

If you would like to contribute to Deep Xi, please investigate the following and compare it to current models:

Currently, the ResLSTM network is not performing as well as expected (when compared to TensorFlow 1.x performance).

Where can I get a dataset for Deep Xi?

Open-source training and testing sets are available for Deep Xi on IEEE DataPort:

[4] Deep Xi dataset (training, validation, and test set): http://dx.doi.org/10.21227/3adt-pb04.

[5] Test set from the original Deep Xi paper: http://dx.doi.org/10.21227/0ppr-yy46.

The MATLAB scripts used to generate these sets can be found in set.

Which audio do I use with Deep Xi?

Deep Xi operates on mono/single-channel audio (not stereo/dual-channel audio). Single-channel audio is used due to most cell phones using a single microphone. The available trained models operate on a sampling frequency of f_s=16000Hz, which is currently the standard sampling frequency used in the speech enhancement community. The sampling frequency can be changed in run.sh. Deep Xi can be trained using a higher sampling frequency (e.g. f_s=44100Hz), but this is unnecessary as human speech rarely exceeds 8 kHz (the Nyquist frequency of f_s=16000Hz is 8 kHz). The available trained models operate on a window duration and shift of T_d=32ms and T_s=16ms, respectively. To train a model on a different window duration and shift, T_d and T_s can be changed in run.sh. Currently, Deep Xi supports .wav, .mp3, and .flac audio codecs. The audio codec and bit rate does not affect the performance of Deep Xi.

Naming convention in the `set/` directory

The following is already configured in the Deep Xi dataset.

Training set

The filenames of the waveforms in the train_clean_speech and train_noise directories are not restricted. There can be a different number of waveforms in each. The Deep Xi framework utilises each of the waveforms in train_clean_speech once during an epoch. For each train_clean_speech waveform of a mini-batch, the Deep Xi framework selects a random section of a randomely selected waveform from train_noise (that is at a length greater than or equal to the train_clean_speech waveform) and adds it to the train_clean_speech waveform at a randomly selected SNR level (the SNR level range can be set in run.sh).

Validation set

As the validation set must not change from epoch to epoch, a set of restrictions apply to the waveforms in val_clean_speech and val_noise. There must be the same amount of waveforms in val_clean_speech and val_noise. One waveform in val_clean_speech corresponds to only one waveform in val_noise, i.e. a clean speech and noise validation waveform pair. Each clean speech and noise validation waveform pair must have identical filenames and and an identical number of samples. Each clean speech and noise validation waveform pair must have the SNR level (dB) that they are to be mixed at placed at the end of their filenames. The convention used is _XdB, where X is replaced with the desired SNR level. E.g. val_clean_speech/NAME_-5dB.wav and val_noise/NAME_-5dB.wav. An example of the filenames for a clean speech and noise validation waveform pair is as follows: val_clean_speech/198_19-198-0003_Machinery17_15dB.wav and val_noise/198_19-198-0003_Machinery17_15dB.wav.

Test set

The filenames of the waveforms in the test_noisy_speech directory are not restricted. This is all that is required if you want inference outputs from Deep Xi, i.e. ./run.sh VER="ANY_NAME" INFER=1. If you are obtaining objective scores by using ./run.sh VER="ANY_NAME" TEST=1, then reference waveforms for the objective measures need to be placed in test_clean_speech. The waveforms in test_clean_speech and test_noisy_speech that correspond to each other must have the same number of samples (i.e. the same sequence length). The filename of the waveform in test_clean_speech that corresponds to a waveform in test_noisy_speech must be contained in the corresponding test noisy speech waveforn filename. E.g. if the filename of a test noisy speech waveform is test_noisy_speech/61-70968-0000_SIGNAL021_-5dB.wav, then the filename of the corresponding test clean speech waveform must be contained in the filename of the test noisy speech waveform: test_clean_speech/61-70968-0000.wav. This is because a test clean speech waveform may be used as a reference for multiple waveforms in test_noisy_speech (e.g. test_noisy_speech/61-70968-0000_SIGNAL021_0dB.wav, test_noisy_speech/61-70968-0000_SIGNAL021_5dB.wav, and test_noisy_speech/61-70968-0000_SIGNAL021_10dB.wav are additional test noisy speech waveforms that the test clean speech waveform from the previous example is a reference for).

Citation guide

Please cite the following depending on what you are using:

The Deep Xi framework is proposed in [1].
If using Deep Xi-MHANet, please cite [1] and [6].
If using Deep Xi-ResLSTM, please cite [1].
If using Deep Xi-ResNet, please cite [1] and [2].
If using DeepMMSE, please cite [2].
If using Deep Xi-RDLNet, please cite [1] and [3].
If using Deep Xi dataset, please cite [4].
If using the Test Set From 10.1016/j.specom.2019.06.002, please cite [5].

[1] A. Nicolson, K. K. Paliwal, Deep learning for minimum mean-square error approaches to speech enhancement, Speech Communication 111 (2019) 44 - 55, https://doi.org/10.1016/j.specom.2019.06.002.

[2] Q. Zhang, A. M. Nicolson, M. Wang, K. Paliwal and C. Wang, "DeepMMSE: A Deep Learning Approach to MMSE-based Noise Power Spectral Density Estimation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1404-1415, 2020, doi: 10.1109/TASLP.2020.2987441.

[3] Mohammad Nikzad, Aaron Nicolson, Yongsheng Gao, Jun Zhou, Kuldip K. Paliwal, and Fanhua Shang. "Deep residual-dense lattice network for speech enhancement". In AAAI Conference on Artificial Intelligence, pages 8552–8559, 2020

[4] Aaron Nicolson, "Deep Xi dataset", IEEE Dataport, 2020. [Online]. Available: http://dx.doi.org/10.21227/3adt-pb04.

[5] Aaron Nicolson, "Test Set From 10.1016/j.specom.2019.06.002", IEEE Dataport, 2020. [Online]. Available: http://dx.doi.org/10.21227/0ppr-yy46.

[6] A. Nicolson, K. K. Paliwal, Masked multi-head self-attention for causal speech enhancement, Speech Communication 125 (2020) 80 - 96, https://doi.org/10.1016/j.specom.2019.06.002.

deepxi's People

Contributors

Stargazers

Watchers

Forkers

hongwen-sun yunzqq fendaq xinkez lym0302 ruizewang breeef byfaith mingmchen titospadini jfsantos ishine g00184435 wikipedia2008 dennistang742 templeblock rmithyx woshinifeng alchemille chenchy runngezhang xiongmaoxia supreet21 speechdnn jgliu2015 shenxuhui arunprakash-a surajdurgesht windstudent jiayong xiaoyaoxiaoxian unclerice hash2430 timexue rpersie shamoons zk1001 senpin kingstorm librence dung-n-tran fromprimary xrick huangzhenyu yuzhongshanyue panxin801 fdeng1983 st-tomic halimaz xiaming9880 okrio reloadbrain luyun2hit cp917 road2018 wangtao2668129173 funny07 speech-yu archiki hyli666 bob-hu wangpy1204 yichihuang meadow163 wang-asher xiaozhuo12138 signin9anil zhangxinaaaa scirac aishinchi zhaoyj1122 yangyang maoxin7676 razorenhua vancause haibit yww624 zyberg2091 rxhmdia gmisa1990 3dimaging oucxlw gxu82 frankieee111 golfbears entzug fc4976 nxdzyl hxing093020 bubing yyht joshgreifer fabianhoerst wushichao panda1018 maxmax2016 hangtingchen songtaoshi webrtcman forssil

deepxi's Issues

‘train_s_list’ not found.

Finding sample statistics...
Traceback (most recent call last):
  File "deepxi.py", line 244, in <module>
    args = get_stats(args, config)
  File "deepxi.py", line 95, in get_stats
    random.shuffle(args.train_s_list) # shuffle list.
AttributeError: 'Namespace' object has no attribute 'train_s_list'

Could you help me with this? Thanks.

Mean Sigmoid Cross Entropy Loss question

Hi Aaron.

Could you please help me understand the mean_sigmoid_cross_entropy loss implementation in DeepXi?

So the inputs are noisy speech magnitude spectrum mbatch[0], and predicted SNR net.target_ph: mbatch[1] .

I am having difficulties figuring it out from the code. How are you calculating the loss between the features and the SNR prediction (as they seem to be different types of information and even shape) ?

Loss

Hi Aaron.

Could you please help me understand the loss and val error?

val_error_mbatch = sess.run(net.loss, feed_dict={net.input_ph: mbatch[0], net.target_ph: mbatch[1], net.nframes_ph: mbatch[2], net.training_ph: False}) # validation error for each frame in mini-batch.
So mini-batch contains: x_MAG-fr (noisy), xi_bar (target xi), L .

I'm unsure at which step is network making a prediction about the SNR.
Or it is just learning how to map the target SNR to a defined value and just apply predicted value in inference?

How to get my own mu.mat？

Hi, @anicolson ,
Thanks for your great work!
I want to train DeepXi model using my own data now. But I don't know how to get mu.mat and sigma.mat based on my data. Could you give me some help?
Any advices will be appreciated!

Training on other noise dataset gives resuts worse than unprocessed noisy

Hi @anicolson!
I have been working with a custom noise dataset compile from freesound.org which has 7 types of noises- including babble, traffic, etc. I have 10 noise clips for each noise type for training and 8 different clips for testing. I'm using the clean speech from the Librispeech Dataset and the respective train, val and test folders. I trained the network for 200 epochs after deleting the data directory and making the corresponding /set folder as required. Yet I notice that the all the metrics are lower for the trained model as compared to unprocessed speech. ( Initially, I had not deleted the data folder before training, but even after doing that and retraining the results did not improve at all). Here are the metrics:
Model,MOS-LQO,PESQ,STOI,eSTOI
Unprocessed,1.29,1.80,80.68,60.66
DeepXi,1.34,1.58,40.76,25.35
The unprocessed has been calculated in the same way as the DeepXi, by bypassing the model. Can you help me out with what has gone wrong?
Thanks in advance!

i can not find the freesound packs. does it mean Sound Ids

i can not find the freesound packs. does it mean Sound Ids.

Is ResNet 3f causal？

Good job. As layer normalization is widely used in the ResNet 3f, I doubt that it's a causal network—‘future’ features are actually included. I've tried to remove all the layer normalizations and the results turn out to be much worse.
Is layer normalization dispensable?

Error loading pretrain model during inference

The running environment is Cuda 10.1, tensorflow 2.2. The error message is as follows:

Total params: 1,949,953
Trainable params: 1,949,953
Non-trainable params: 0

Traceback (most recent call last):
File "/root/sw/env4py36_deepxi/lib/python3.6/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model/resnet-1.1n/epoch-179/variables/variables

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 92, in
saved_data_path=args.saved_data_path,
File "/root/sw/DeepXi-master/deepxi/model.py", line 269, in infer
self.model.load_weights(model_path + '/epoch-' + str(e-1) +'/variables/variables' )
File "/root/sw/env4py36_deepxi/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
File "/root/sw/env4py36_deepxi/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 1231, in load_weights
py_checkpoint_reader.NewCheckpointReader(filepath)
File "/root/sw/env4py36_deepxi/lib/python3.6/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 99, in NewCheckpointReader
error_translator(e)
File "/root/sw/env4py36_deepxi/lib/python3.6/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 35, in error_translator
raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model/resnet-1.1n/epoch-179/variables/variables

Can the MHANet run in real time

Hi,

I am confused whether the MHANet works in real time. From my understrand, the masked attention only match causal scenario, may be not applicable to real tme.

Best Regards, looking forward to your reply.

AttributeError: 'Namespace' object has no attribute 'mu'

Hi ,

I am trying to run Inference by - python3 deepxi.py --infer 1 --out_type y --gain mmse-lsa --gpu 0
But I am getting the follwing -

Traceback (most recent call last):
File "deepxi.py", line 284, in
net = deepxi_net(args)
File "deepxi.py", line 87, in init
args.fs, self.P, args.nconst, args.mu, args.sigma) # feature graph.
AttributeError: 'Namespace' object has no attribute 'mu'

it happen also when I provide stats_path

no deepxi.network.attention.py file thus no MHANet class

Error during Inference

Hi,

Thank you for your effort in DeepXi @anicolson .

I have been able to test it for inference with the files you provide, but when I put my own file to denoise, I got the following error:

The test_x list has a total of 2 entries.
Loading sample statistics from pickle file...
Preparing graph...
Inference...
  0%|                                                           | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[89] = 89 is not in [0, 89)
	 [[{{node boolean_mask/GatherV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deepxi.py", line 44, in <module>
    if args.infer: infer(sess, net, args)
  File "lib/dev/infer.py", line 36, in infer
    net.nframes_ph: input_feat[1], net.training_ph: False}) # output of network.
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[89] = 89 is not in [0, 89)
	 [[node boolean_mask/GatherV2 (defined at lib/dev/ResNet.py:88) ]]

Original stack trace for 'boolean_mask/GatherV2':
  File "deepxi.py", line 40, in <module>
    net = deepxi_net.deepxi_net(args)
  File "lib/dev/deepxi_net.py", line 32, in __init__
    d_model=args.d_model, d_f=args.d_f, k_size=args.k_size, max_d_rate=args.max_d_rate)
  File "lib/dev/ResNet.py", line 88, in ResNet
    if boolean_mask: blocks[-1] = tf.boolean_mask(blocks[-1], tf.sequence_mask(seq_len))
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 1386, in boolean_mask
    return _apply_mask_1d(tensor, mask, axis)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 1355, in _apply_mask_1d
    return gather(reshaped_tensor, indices, axis=axis)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 3475, in gather
    return gen_array_ops.gather_v2(params, indices, axis, name=name)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4097, in gather_v2
    batch_dims=batch_dims, name=name)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/home/betegon/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Do you know why is it happening?

This is what I did:

Remove the file set/test_noisy_speech/FB_FB10_07_voice-babble_5dB.wav and add my own .wav file and call it like the old one: FB_FB10_07_voice-babble_5dB.wav.
Do inference as I did before:

python3 deepxi.py --infer 1 --out_type y --gain mmse-lsa --ver '3f' --epoch 175 --gpu 0

Also, why does it look like there are two files to denoise ?

The test_x list has a total of 2 entries.
Loading sample statistics from pickle file...
Preparing graph...
Inference...
  0%|                                                           | 0/2 [00:01<?, ?it/s]

Loading flac files

Hi, thanks for the great work.

The code seems to show that its possible to load flac files, but the module skips them during inference.

Is there a way to load flac files or I must convert them all to wav?

what about the speech enhancement performance between DeepXi-ResBLSTM and DeepXi-TCN?

As you mentioned in "README"
The ResLSTM and ResBLSTM networks used for Deep Xi in [1] (Deep Xi - ResLSTM and Deep Xi - ResBSLTM) have been replaced with a residual network (ResNet) that employs TCN.

becuase there isn't any description about DeepXi-TCN in the paper "Deep learning for minimum mean-square error approaches to speech enhancement, Speech", so what about the speech enhancement performance between DeepXi-ResBLSTM and DeepXi-TCN?

Thanks

What is the main difference between 3a version and 3d version?

Hi, anicolson:
I have downloaded your 3d version to train enhance model. Based on my data, the ASR performance of 3d version is worse than that of 3a version on same epoch's model. I wondered that what is the main difference between 3a version and 3d version beside automatic computing of statistic information.

I can't find the implementation of addnoise function from deep_xi_test_set.m

Is it placed in other module? Can you help me please to find it?
Thanks

How to get single value for xi_hat?

Thanks for your work, we would like to test whether your approach works better than what we are currently using to detect "good" audio. We are inferring with
deepxi.py --infer 1 --out_type xi_hat --gain mmse-lsa

and get the mat files containing the output arrays. How do we interpret this data or do you see an easy function to boil it down to a single value?

Additional questions

Hello again,

I am trying to reproduce the deepxi framework in the torch (tensorflow is not so familiar to me.. lol) and have some questions.

The Demand voicebank (valentini) dataset provides a training set in the form of (noisy, clean) pairs for each utterance.

When we subtract the clean from noisy, we can get the corresponding noise signal.

For the Demand voicebank dataset, did you use only those dataset pairs (they were provided)? or an additional clean or noise dataset?

In my previous question, you said that the noise recording used to corrupt the clean speech is randomly selected. (this imply noise recording should be longer than clean speech)

If then, Could you tell me how kind of additional noise recording did you use? and Have you used additional clean speech other than provided in Demand voicebank dataset?

In the training step, Deepxi uses both the training set and validation set.

As far as I know, the validation set is often used for early stopping. Is the validation set in deepxi framework also be used for this purpose?

Could you explain to me how the validation set was used?

Thank you!

What is the major consideration when choosing Hamming window with no periodic?

Hi, Aaron, thanks for your fantastic work.
I tried to change the window you used from
functools.partial(window_ops.hamming_window, periodic=False)
to
functools.partial(window_ops.hann_window, periodic=True)
Then, it seems that the convergence becomes slower than before.

By comparison, the Hamming window does a better job of cancelling the nearest side lobe, but a poorer job of canceling any others. Is this the reason why you choose Hamming? Thanks.

Add noise creates nan values

Hi Aaron.

I'm experimenting with training.
Just mbatch[0] is getting filled with many nan and zero values after mixing the clean and noisy samples.

Disabling addition of noise clears mbatch[0] from nan values.

Is there some specific rule to follow when adding clean and noisy data?
I'm experimenting with small section of Librispeech and Environmental Background Noise dataset so the file lengths should be ok.

ValueError: Sample larger than population

I enter the command: python3 deepxi.py --train 1 --verbose 1 --gpu 0 . and There was an error and don't know how to solve it .
bus id: 0000:01:00.0, compute capability: 6.1)
Creating epoch parameters, as no pickle file exists...
Traceback (most recent call last):
File "deepxi.py", line 263, in
mbatch_size_iter, train_clean_speech_mbatch_seq_len) # generate mini-batch of noise training waveforms.
File "../../../../lib\batch.py", line 53, in _noise_mbatch
mbatch_list = random.sample(noise_list, mbatch_size) # get mini-batch list from training list.
File "C:\Program Files\Anaconda3\lib\random.py", line 315, in sample
raise ValueError("Sample larger than population")
ValueError: Sample larger than population

Clarity on file format reqruired in '/set' folder

Hi @anicolson ,
Can you elaborate which folders are to be present in the set folder and what files they should contain along with the naming restrictions. It is not very clear in your readme files.
Thanks in advance!

Cannot feed value of shape (1, 9582560, 2) for Tensor 's_ph:0', which has shape '(?, ?)'

I'm trying to test denoise with my file, after put my .wav file to test folder and run inference command, i got these errors, how could I fix it? thank you.
File "deepxi.py", line 44, in <module> if args.infer: infer(sess, net, args) File "lib/dev/infer.py", line 34, in infer input_feat = sess.run(net.infer_feat, feed_dict={net.s_ph: [wav], net.s_len_ph: [j['seq_len']]}) # sample of training set. File "/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1149, in _run str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (1, 9582560, 2) for Tensor 's_ph:0', which has shape '(?, ?)'

Can't find the pre-trained models 3f

I would just like to try to experiment on the pre-trained model that you have built, unfortunately, the page redirects to a site saying that it is not the page I am looking for.

Training a own noice error

Training...
WARNING:tensorflow:From /home/giuser/.local/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2019-10-19 13:24:47.622125: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Training E101 (ver=3d, gpu=, params=1.97555e+06)...
0%| | 0/8 [00:00<?, ?it/s]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
12%|████████████████ | 1/8 [00:16<01:54, 16.33s/it]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
25%|████████████████████████████████ | 2/8 [00:16<01:09, 11.61s/it]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
38%|████████████████████████████████████████████████ | 3/8 [00:17<00:41, 8.30s/it]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
50%|████████████████████████████████████████████████████████████████ | 4/8 [00:18<00:24, 6.02s/it]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
62%|████████████████████████████████████████████████████████████████████████████████ | 5/8 [00:18<00:13, 4.39s/it]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
75%|████████████████████████████████████████████████████████████████████████████████████████████████ | 6/8 [00:19<00:06, 3.20s/it]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
88%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 7/8 [00:19<00:02, 2.38s/it]Trainer and Loss [<tf.Operation 'adam_opt/Adam' type=NoOp>, <tf.Tensor 'Mean:0' shape=() dtype=float32>]
mbatch_err:::::::: nan
mbatch_count:::::::: 0
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:19<00:00, 2.50s/it]
Validation error for E101...
0%| | 0/1 [00:00<?, ?it/s]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.78s/it]
Traceback (most recent call last):
File "deepxi.py", line 256, in
if args.train: train(sess, net, args)
File "deepxi.py", line 200, in train
(epoch_comp, train_err/mbatch_count, val_error))
ZeroDivisionError: division by zero

Some questions

Hello, Thank you for sharing the good projects.
I have some questions.

for the demand_voice_bank dataset, the described best model (in terms of PESQ) is resnet-1.1c (ResNetV2). Then does ResNetV3(which has a very slight difference fromResNetV2) give better/worse performance compared to the ResNetV2? Have you ever tried to compare?
for the demand_voice_bank dataset, does the validation set (if it is used) is fixed? and
As far as I understand, each utterance in the training set is mixed with random SNR (in the range of [0 5 10 15]) with fixed clean/noise pair) for each epoch. Is it correct..?
I can get the mu/sigma for the CDF mapping function in the data/stats.mat. Is it OK to use these parameters for the demand_voice_bank dataset? OR do I need to calculate the new mu/sigma parameter set?
In deepxi/gain, the function 'deepmmse' returns (1/1+xi) + (xi/(gamma(1+xi))).
But.. is it right that (1/1+xi)^2 + (xi/(gamma(1+xi)))?

Thank you.

why you limit your clean and noise wav?

mhanet loss results

Hi @anicolson!
I've being training with mhanet while the loss is all the way around 0.37,

I noticed you upload some loss info of resnet based networks
https://github.com/anicolson/DeepXi/tree/master/log/loss

Would you mind give some info abou mhanet base network.

ValueError: No sample.npz file exists. (Inference)

I'm trying to run inference using the command ./run.sh VER="mhanet-1.1c" INFER=1 GAIN="mmse-lsa". I just want to use the pretrained model to operate on some noisy .wav files.

However, when I run this command, I get this error:

This workstation is not known.
Finding GPU/s...
1 total GPU/s.
Using GPU 0.
2023-06-15 10:36:40.737557: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-15 10:36:41.308076: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Arguments:
gpu 0
ver mhanet-1.1c
test_epoch 200
train False
infer True
test False
spect_dist False
prelim False
verbose False
network_type MHANetV3
inp_tgt_type MagXi
sd_snr_levels [-5, 0, 5, 10, 15]
mbatch_size 8
sample_size 1000
max_epochs 200
resume_epoch 0
save_model True
log_iter False
eval_example True
val_flag True
reset_inp_tgt False
reset_sample False
out_type y
gain mmse-lsa
model_path /home/ericl/deepxi/DeepXi/model
set_path set
log_path log
data_path /home/ericl/deepxi/data/input
test_x_path set/test_noisy_speech
test_s_path set/test_clean_speech
test_d_path set/test_noise
out_path /home/ericl/deepxi/data/output
saved_data_path None
min_snr -10
max_snr 20
snr_inter 1
f_s 16000
T_d 32
T_s 16
n_filters None
d_in None
d_out None
d_model 256
n_blocks 5
n_heads 8
d_b None
d_f None
d_ff None
k None
max_d_rate None
causal True
warmup_steps 40000
length None
m_1 None
centre None
scale None
unit_type None
loss_fnc BinaryCrossentropy
outp_act Sigmoid
max_len 2048
map_type DBNormalCDF
map_params [None, None]
2023-06-15 10:36:42.602692: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-15 10:36:42.710381: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Version: mhanet-1.1c.
Traceback (most recent call last):
  File "/home/ericl/deepxi/DeepXi/main.py", line 53, in <module>
    deepxi = DeepXi(
  File "/home/ericl/deepxi/DeepXi/deepxi/model.py", line 87, in __init__
    s_sample, d_sample, x_sample, wav_len = self.sample(sample_size, sample_dir)
  File "/home/ericl/deepxi/DeepXi/deepxi/model.py", line 464, in sample
    raise ValueError('No sample.npz file exists.')
ValueError: No sample.npz file exists.

Am I supposed to train the model from scratch just so I can use it for inference? How do I get the sample.npz file?

This is my config.sh:

#!/bin/bash

PROJ_DIR='deepxi'
NEGATIVE="-"

set -o noglob

## Use hostname or whoami to set the paths on your workstation.

... (other code)
  *) echo "This workstation is not known."
      LOG_PATH='log'
      SET_PATH='set'
      DATA_PATH='/home/ericl/deepxi/data/input'
      TEST_X_PATH='set/test_noisy_speech'
      TEST_S_PATH='set/test_clean_speech'
      TEST_D_PATH='set/test_noise'
      OUT_PATH='/home/ericl/deepxi/data/output'
      MODEL_PATH='/home/ericl/deepxi/DeepXi/model'
    ;;
  esac
  ;;
esac

get_free_gpu () {
  echo "Finding GPU/s..."
  if ! [ -x "$(command -v nvidia-smi)" ];
  then
    echo "nvidia-smi does not exist, using CPU instead."
    GPU=-1
  else
    NUM_GPU=$( nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader | wc -l )
    echo "$NUM_GPU total GPU/s."
    while true
    do
      for (( GPU=0; GPU<$NUM_GPU; GPU++ ))
      do
        VAR1=$( nvidia-smi -i $GPU --query-gpu=pci.bus_id --format=csv,noheader )
        VAR2=$( nvidia-smi -i $GPU --query-compute-apps=gpu_bus_id --format=csv,noheader | head -n 1)
        if [ "$VAR1" != "$VAR2" ]
        then
          echo "Using GPU $GPU."
          return
        fi
      done
      echo 'Waiting for free GPU.'
      sleep 1m
    done
  fi
}

VER=0
TRAIN=0
INFER=0
TEST=0
OUT_TYPE='y'
GAIN='mmse-lsa'

for ARGUMENT in "$@"
do
    KEY=$(echo $ARGUMENT | cut -f1 -d=)
    VALUE=$(echo $ARGUMENT | cut -f2 -d=)
    case "$KEY" in
            VER)                 VER=${VALUE} ;;
            GPU)                 GPU=${VALUE} ;;
            TRAIN)               TRAIN=${VALUE} ;;
            INFER)               INFER=${VALUE} ;;
            TEST)                TEST=${VALUE} ;;
            OUT_TYPE)            OUT_TYPE=${VALUE} ;;
            GAIN)                GAIN=${VALUE} ;;
            *)
    esac
done

WAIT=0
if [ -z $GPU ]
then
    get_free_gpu $WAIT
    GPU=$?
fi

Question about data preparation & train

Anicolson, thanks for your great effort.

I used Tesla P40 for training. However, I found the time-cost in training is too long with mbatch=10.

I have some questions about data preparation & training step:

Do you use all of the training sets ( 74250 utterances) in the training step?
How long does 1 epoch spend to train?
Could you please share the device info for training?

Thanks a lot!

Data loading suggestion

Hi @anicolson

I have a small suggestion for data loading.
Deep XI assumes that all audio files should be placed in one directory. It would be more practical if it would search in sub directories and load all files with allowed extensions.

Im sure that you are aware that it can be easily done something like the following:
recurs_str='**/*'
for i in extension:
for j in glob.glob(os.path.join(file_dir,recurs_str, i), recursive=True):
or with os:
for root, folders, files in os.walk(file_dir): '
for file in files:
if file.lower().endswith(' any of the extensions') load the file and save path

I am using both ways for experimentation and it performs as expected.

Best Regards

get an erorr when run python3 deepxi.py --infer 1 --out_type y --gain mmse-lsa --gpu 0

Hi anicolson,
It seems the latest version has problem when only run 'python3 deepxi.py --infer 1 --out_type y --gain mmse-lsa --gpu 0' command.
train_s_list can not be found . It seems I must run run.sh script. Is it possible only performance Inference?

File "deepxi.py", line 254, in
args = get_stats(args, config)
File "deepxi.py", line 95, in get_stats
random.shuffle(args.train_s_list) # shuffle list.
AttributeError: 'Namespace' object has no attribute 'train_s_list'

Running Inference/Testing on Multiple GPUs

Hi @anicolson!
I have trained a model on my custom noise dataset and have been trying to run inference and testing. When testing I observe that after loading the files for inference, I encounter OOM error. Although I was able to train with 1 GPU, with memory about ~10GB, testing requires more space. (Is this expected behavior?) I have access to multiple GPUs, so can you tell me if and how can I use multiple GPUs for the purpose of testing. This is not immediately straightforward from run.sh. Also, can you elaborate on how by just specifying the version does your code select the best iteration?
Thanks in advance!

Denoise Live Microphone Feed

Hi,

Thank you for open sourcing such awesome work.

i'm trying to deploy your model for live de-noising of microphone input:
using the default parameters and the pretrained resnet-1c model.
But the output is cancelling out the speech and adding loud artefacts.

please suggest if this model can be used at all for live processing. if yes, what do i need to do in terms of pre/post-processing

Training Error?

While following all the training steps i encountered the following training error again and again. Whereas i am not getting from where this error has been occurred?
Please explain me a bit about error and how to resolve the same?

`E1: 100.0% (train err 109.10), E0 val err: inf, 3a, GPU:0.
Traceback (most recent call last):
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Need minval < maxval, got 0 >= -39117
[[{{node map/while/random_uniform}}]]
[[map/while/Slice_1/size/_2979]]
(1) Invalid argument: Need minval < maxval, got 0 >= -39117
[[{{node map/while/random_uniform}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "deepxi.py", line 290, in
if args.train: train(sess, net, args)
File "deepxi.py", line 195, in train
net.s_len_ph: args.val_s_len[start_idx:end_idx], net.d_len_ph: args.val_d_len[start_idx:end_idx], net.snr_ph: args.val_snr[start_idx:end_idx]}) # mini-batch.
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Need minval < maxval, got 0 >= -39117
[[node map/while/random_uniform (defined at lib/feat.py:434) ]]
[[map/while/Slice_1/size/_2979]]
(1) Invalid argument: Need minval < maxval, got 0 >= -39117
[[node map/while/random_uniform (defined at lib/feat.py:434) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'map/while/random_uniform':
File "deepxi.py", line 284, in
net = deepxi_net(args)
File "deepxi.py", line 86, in init
self.feature = feat.xi_mapped(self.s_ph, self.d_ph, self.s_len_ph, self.d_len_ph, self.snr_ph, args.Nw, args.Ns, args.NFFT, args.fs, self.P, args.nconst, self.mu, self.sigma) # feature graph.
File "lib/feat.py", line 43, in xi_mapped
P, nconst), (s, d, s_len, d_len, Q), dtype=(tf.float32, tf.float32, tf.float32)) # padded waveforms.
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/map_fn.py", line 268, in map_fn
maximum_iterations=n)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3501, in while_loop
return_same_structure)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3012, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2937, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3456, in
body = lambda i, lv: (i + 1, orig_body(*lv))
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/map_fn.py", line 257, in compute
packed_fn_values = fn(packed_values)
File "lib/feat.py", line 43, in
P, nconst), (s, d, s_len, d_len, Q), dtype=(tf.float32, tf.float32, tf.float32)) # padded waveforms.
File "lib/feat.py", line 410, in addnoisepad
(y, d) = addnoise(x, d, Q) # compute noisy waveform.
File "lib/feat.py", line 434, in addnoise
i = tf.random_uniform([1], 0, tf.add(1, tf.subtract(d_len, x_len)), tf.int32)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/random_ops.py", line 245, in random_uniform
shape, minval, maxval, seed=seed1, seed2=seed2, name=name)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/ops/gen_random_ops.py", line 919, in random_uniform_int
seed=seed, seed2=seed2, name=name)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/home/paperspace/anaconda3/envs/tensorflow_gpuenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

Resolve the distortion caused by integer overflow

When the data in the wav files are very big, closing to -32768 or 32767, the data after speech enhancement is likely tobe overflow, thus resulting distortion. I find two places to correct this issue and I hope it works for you.

utils.py, change 32768.0 to 32767.0, since +32768 becomes -32768. More formally, use 32767.0 for positive number and 32768.0 for negative number.

def save_wav(save_path, f_s, wav):
	if isinstance(wav[0], np.float32): wav = np.asarray(np.multiply(wav, 32767.0), dtype=np.int16)
	wav_write(save_path, f_s, wav)

infer.py, add y = np.clip(y,-1,1) before writing the file, since wav data can not exceed 1.0.

elif args.out_type == 'y':
	y_MAG = np.multiply(input_feat[0], gain.gfunc(xi_hat, xi_hat+1, gtype=args.gain))
	y = np.squeeze(sess.run(net.y, feed_dict={net.y_MAG_ph: y_MAG, net.x_PHA_ph: input_feat[2], net.nframes_ph: input_feat[1], net.training_ph: False})) # output of network.
	y = np.clip(y,-1,1)
	if np.isnan(y).any(): ValueError('NaN values found in enhanced speech.')
	if np.isinf(y).any(): ValueError('Inf values found in enhanced speech.')
	utils.save_wav(args.out_path + '/' + file_name + '.wav', args.f_s, y)

Do i need to create my own mu.mat?

I am using my own clean speech data and noises. Do i have to create new mu.mat & sigma.mat or not?

How to run version resnet-1.0n?

I try to run the command with version resnet-1.0n but it responsed this:
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on model/resnet-1.0n/epoch-179/variables/variables: Not found: model/resnet-1.0n/epoch-179/variables; No such file or directoryValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on model/resnet-1.0n/epoch-179/variables/variables: Not found: model/resnet-1.0n/epoch-179/variables; No such file or directory

can i train this model without gpu?

i have trouble in installing cuda,so can i run the code without gpu?thanks

training error information

like this:
E1: 98% (train err 160.54 ) E0 val err: inf, 3a, GPU:1.
why is E0 err inf?

And then some error information is dumped out:
2019-07-15 11:40:25.994360: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at tensor_array_ops.cc:497 : Invalid argument: TensorArray map/TensorArray_4_7605: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: . It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
......

Errors during training

When I put 100 noise voices in the training noise folder, run python deepxi.py --train 1 --ver 'ANY_NAME' --gpu 0, the code will run normally. But when I put 2000 noise speech in the training noise folder, the code will report an error when the second epoch ends. The error is as follows:

Multi gpu trainin

Hello. Its great to see Deep Xi in tfv2 :)

Is multi gpu training supported in utils/gpu_config? Can we pass more than 1 gpu value as run.sh argument?
I am trying to implement Distributed training with Keras, just it seems to have a conflict with gpu finding process in run.sh.

Inference denoise example

Hi @anicolson ,

I will want to know if you could write in a more descriptive way, how to do inference with pre-trained weights provided (3f). I wrote about how to achieve an inference in an already closed issue #19 , but there is still gaps to fill.

This will allow people to give it a quick try at the repo and keep digging in.

What about the characteristics of the input audio to perform inference?
1. Should it have a determined sample rate, e.g. 16000 Hz ?
2. Should it have a determined bit rate, e.g. 256 kbps ?

Is mmse-lsa the best gain function to use?
As per the result's table in the README it looks like, but it is also using ResNet 3e, what about 3f?

What should be the expected result of the inference if performed correctly? Could you provide a test audio and a result audio using the weights of 3f 175 epochs + mmse-lsa?
This way, we could know if we have the things set up correctly. Personally, I have try to infer some audios and I am not sure if it is the expected result or know, so this will be a good way to check that.

Also, I would like to offer my help (if you allow people to contribute) by adding utils script to allow users to input any kind of audio and convert into the optimal one for the network once these questions are answered.

Thank you so much for you hard work and effort,

kind regards.

Is ResNet 3f causal？A question about causal convolution.

Hi, thanks for the code sharing.

In the def CausalDilatedConv1d, Zeros are concatenated on the left of x to enable dilated convolution. But in the convolution, data on the right side of the current frame, that is the future data, are used. I think only the data on the left of the current frame, which means history data, can be convolved. Can you help me to understand how the causal convolution is achieved here.

Thank you！

def CausalDilatedConv1d(x, d_f, k_size, d_rate=1, use_bias=True):
    if k_size > 1: # padding for causality.
        x_shape = tf.shape(x)
	x = tf.concat([tf.zeros([x_shape[0], (k_size - 1)*d_rate, x_shape[2]]), x], 1)
    return tf.layers.conv1d(x, d_f, k_size, dilation_rate=d_rate, activation=None, padding='valid', use_bias=use_bias)

Can you share the ResLSTM codes?

Thank you for such a fantastic work.

I am also interested in the ResLSTM and want to train it myself. Could you share the codes with me? I' d appreciate any help. By the way, what is the RDLNet? Is it a great progress in WER? When will it be released?

Thank you!

'Namespace' object has no attribute 'train_s_list'

Sir, when i put data on the folder set, than run file "run.m", i got these errors, how could i fix it?

/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/minhhuypham/venv/DeepXi/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Creating test_x list, as no pickle file exists...
The test_x list has a total of 1 entries.
Finding sample statistics...
Traceback (most recent call last):
File "deepxi.py", line 37, in
args = get_stats(args.data_path, args, config)
File "lib/dev/sample_stats.py", line 28, in get_stats
random.shuffle(args.train_s_list) # shuffle list.
AttributeError: 'Namespace' object has no attribute 'train_s_list'

How to decrease the distortion of enhanced speech?

Hi, anicolson
I got my own enhanced speech by using your project. I can observe the improvement of the enhance speech. Now I want to use the enhanced speech to perform ASR. But I find the performance of enhanced speech is not better than that of noisy speech. How should I adjust the parameters or change the code to decrease the distortion of the enhanced speech? Could you give some directions about it?
Thank you very much!

Some questions about the data and the training process

Hi,
First of all let me express my appreciation for this project. Very interesting.
I have some questions regarding the data and the training process:

I see that most of the clean speech utterances in train-clean-100 and in VCTK are shorter than the receptive field size of the network. Does it mean that this specific databases don't exploit the full potential of the network?
Under the assumption that the network inference is causal (up to one buffer) - does it mean that the SNR estimation in the first seconds is expected to be less accurate due to the short respective field? I mean to the fact that we don't have enough information about the past, similar to a linear filter that has some "delay time" even if it's a minimum phase.
I noticed that when I use short clean speech files (up to 8 sec) for training this "delay time" is shorter than when I use longer files (~up to 20sec). Is it expected? The files in both cases are from the same databases but in the first case I just split them to shorter parts.
Regarding the files normalization - I noticed that you normalize the noise files according to |s|/(|d|*SNR). How sparse noise files should be normalized? The problem is that |d| is very depends on the sparsity of the signal. For example the RMS of noise file containing a slamming door noise every 3 seconds is much lower than the RMS of some constant noise (fan, car, airplane) because in the first case most of the file is a silence. I'm asking because all of the sparse files were created in higher SNR due to very low value of |d|.
I saw that the databases in the link contain only a single speaker in each file. If I'd like to test cases of several speakers (not simultaneously ) in the same file, similar to a real life conversation, does it mean that I need to train the network with several speaker in each file as well?
In other words, does the network study specific speaker characteristics or more general characteristics relevant to the human voice?

Thank you
Ahikam

Understanding the loss implementation

Hi @anicolson .

Please help me understand how binary crossentropy loss is used to train Deep Xi.
I understand the idea and the implementation, but can not fully comprehend the theory behind the loss.

As binary crossentropy implies a kind of classification task, how we can eventually get the right prediction?
I do not understand how is the mapping done between the input features (mag specs) and the targets (SNRs).

Also, I have red the original paper and couldn't find an answer there.

anicolson / deepxi Goto Github PK

deepxi's Introduction

Deep Xi: A Deep Learning Approach to A Priori SNR Estimation for speech enhancement.

News

Contents

Introduction

How does Deep Xi work?

Current networks

Available models

Results

DeepMMSE

Installation

How to use Deep Xi

Current issues and potential areas of improvement

Where can I get a dataset for Deep Xi?

Which audio do I use with Deep Xi?

Naming convention in the set/ directory

Citation guide

deepxi's People

Contributors

Stargazers

Watchers

Forkers

deepxi's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Naming convention in the `set/` directory