jtkim-kaist / vad Goto Github PK

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.

Python 14.77% MATLAB 84.41% HTML 0.78% M 0.01% Shell 0.04%

vad dnn lstm bdnn acam attention speech data voice-detection speech-recognition

vad's Issues

ACAM always detect badly on the start of a corpus

As the title said, I found the corpus at the beginning always be detected as non-speech. Can you explain it?

Dataset download link is unreachable

I cannot download the dataset. The link is unreachable. Can you share the dataset at another link?

A more modern VAD with pre-trained models

I recently just searched the term "VAD" in github search and found many abandoned projects with some decent amount of traction, but mostly lacking pre-trained models.

So I decided to share our new pre-trained VAD:

Please see this vad - https://github.com/snakers4/silero-vad
Quality benchmarks here - https://github.com/snakers4/silero-vad#vad-quality-metrics
Overall comparison with webrtc - wiseman/py-webrtcvad#68

How to create the dataset

I am wondering how to create the training dataset. I have understood the format but don't know how to create one. Do we have to manually annotate the training dataset. Manually annotation would be difficult, is there any utility which can create an approximate training data which can then refined manually.

Can Octave be used?

Is there a way in which Octave can be used instead of Matlab for running the main.m file?
When I try to run it using Octave I get:

warning: called from
    main at line 5 column 1
warning: function ./lib/matlab/stft/example.m shadows a core library function
warning: called from
    main at line 5 column 1
MRCG extraction ...
ans = 1
ans = 1
warning: load: '/home/vlad/Downloads/VAD/lib/matlab/MRCG_features/f_af_bf_cf.mat' found by searching load path
warning: called from
    loudness at line 12 column 1
    gammatone at line 35 column 10
    MRCG_features at line 7 column 3
    mrcg_extract at line 26 column 14
    vad_func at line 7 column 30
    main at line 21 column 12
.........................
error: fftfilt: B must be a vector
error: execution exception in main.m

After how many iterations did your proposed VAD model converge ?

data format

I have not been able to understand the way training data should be specified.
Like how the labels should be written. Do we need to specify the time at which
labels occur in the sound file. If yes how and where

question about DNN learning curve

Hi Kim,
I hope you have a good weekend! When you have a minute, would you please take a look at my questions about your code and let me know what you think of it? Here is the question:

I have run you DNN code with my own timit data and changed the iteration time to get more data as shown in the attachment VAD_DNN.py。However, the valid accuracy is higher than the training accuracy, which is very strange. And the accuracy and cost of training data are fluctuating dramatically. For example, the training accuracy is 47% while the valid is 96% at the last point. Could you explain it?
I also trained my own DNN without using feature extraction and got the opposite result that the accuracy and cost of valid data are fluctuating dramatically. It seems that there is an overfitting problem. The dropout rate is already 0.7. Could you give me some advice

I really appreciate your feedback

Training the net with smaller batch sizes

Hi,

I am a DSP engineer and acoustician. I'm new in the field of machine learning and neural networks and I am learning a lot working on your code :).
I am trying to understand the feasibility to squeeze your VAD graph architecture to be used in real-time. Right now I am able to retrain the neural net modifying a few parameters (mainly the frame window and overlap) using the D2 dataset in your paper. Although, it looks like the best ways to improve the graph computation are:

To use a less complex feature extractor other than the MRCG (I pretend to investigate this hypothesis later).
To retrain the neural net for smaller batch sizes with D2. In your paper your use 4096 which corresponds to several seconds of audio.

I 've tried to perform this training multiple times for different batch sizes (2048,1024,512,...) and I failed so far in this quest. Sometimes the accuracy of the training achieves a high value but it never generalizes for the test data.

I believe something is wrong on the training parameters and that's because I am contacting you for guidelines. Did you ever try to train the neural net for smaller batch sizes? Something should be changed on the net architecture or on the training parameters?

What is the recommended relative size of the audio to be tested over batch size? More clearly: may I test a graph trained with a large batch size with a very small audio sample? My intuition is that it is not possible. That's because I would like to retrain the net with a smaller batch size so that I could perform sequential tests with an audio buffer of reasonable size (let's say 500 ms).

Finally, do you have any recommendation for simplifying the network computational complexity without sacrificing too much in performance?

I appreciate your time in sharing your knowledge and experience,
Cheers,
Lucas

MRCG RAM consumption

Hello Kim.

Using the MATLAB MRCG proved to take a lot of RAM.
I can not use a bigger than 5Min file as an input with 16GB RAM system.
In the paper it was written that a very long sound file was used for train, so how could you compute the MRCG for it?

Thanks, Ravid.

data and model question

Hi @jtkim-kaist ,

Thanks for your great project, it helps me much.

I have several questions about this project, Could you help me? Thanks you in advance.

your prepared data, the label is not that accurate. the short silence in middle of whole speech is labeld as speech. i measured the length, sometime it exceed 100ms, does it degrade the performance? see the below two figure.
for TIMIT, it has *.phn, *.txt and *.wrd file for each audio. My question is how to label the data? do you label the whole audio as speech or use *.wrd to label each word?
the normalization. I find you get a mean and variance for each feature in the whole dataset
in Truelabel2Trueframe.m, line 13, I don't know why you *10. the input is 0 or 1, Not 0.1
In the ACAM model, I found the final full connected layer output is 7 dimension. and it is same as the input frame number, the activation function is sigmoid to get logit, and use tf.square(logit-labels) to calculate the cost function. My question is, if we set the connected layer output as 2 dimension, then softmax it, then use entropy cost function, the label can be 1 if sum(labels(n:n+6))>3, is it OK? it is very popular for classification task.

1.1 this imag, it is TIMIT test data, i use your saved model to run it, from 1.4s to 1.5s, the output probability is still very high. is it right? I think the probability should be reduce in this period.

1.2 this is your clean_speech.wav, the green line is label. from 2.73s to 2.83s, it is >= 1 frame length, but all of them are labeled as speech, is it right?

Thanks
Jinhong

Problem with the training data

Hi everyone,

I added 1s silence before and after each utterance from TIMIT dataset, however, the ACAM and bDNN model couldn't learn from the training data. Instead, it simply predicts all samples to 1 (as shown in the following pictures).

The training data is sampled at 16000 and labeled according to .phn descriptions(see also the following pictures). Does anyone have ideas how to fix that?

examples of training data: https://mcgill-my.sharepoint.com/:u:/g/personal/yifei_zhao_mail_mcgill_ca/EV0mKeH4U7BFpW_ZmyRBQZQBCDSP0quq4rgVsX0CtNlXfw?e=LfW8oJ

Thx!!!

Tensorflow Lite Compatibility

Hello
Is the model compatible with tensorflow lite?

I tried converting it but received the following warning:
Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If you have a custom implementation for them you can disable this error with --allow_custom_ops. Here is a list of operators for which you will need custom implementations: Enter, Exit, FLOOR, LoopCond, RandomUniform, Range, TensorArrayGatherV3, TensorArrayReadV3, TensorArrayScatterV3, TensorArraySizeV3, TensorArrayV3, TensorArrayWriteV3, TensorFlowLess, TensorFlowMerge, TensorFlowSwitch.

Are you planning to do it in the future?

Thanks.

how to test the single wav audio in pretrained model

anyone can you give me a steps to test the pretrained model

Questions about the data normalization

Hi Kim,

Apologize for disturbing you for many times, but I have problem understanding your normalization code. I found some code in the acoustic_feat_ex.m:

%% Save global normalization factor

global_mean = train_mean / length(audio_list);
global_std = train_std / length(audio_list);
save([save_dir, '/global_normalize_factor'], 'global_mean', 'global_std');

and in every data_reader_XXX.py:

norm_param = sio.loadmat(self._norm_dir+'/global_normalize_factor.mat')
self.train_mean = norm_param['global_mean']
self.train_std = norm_param['global_std']

My questions are:

Is a global normalize factor for the whole dataset saved in acoustic_feat_ex.m? Why don't calculate factor for every single train file and apply normalization on it?
If so, why this factor is used also during the prediction phase (because data_reader_XXX.pys are also used during the prediction)? Is this a mistake?

Thanks in advance!
Charlie Jiang

lossing some data

Hi dear

I ran your reposity , but it shown some error .
I find there is no data under data/train/feat/ directory
can you help with me?

Thanks
weizhe

A question from the beginner

Hi,Kim,I'm just a new beginner,and i meet some problems when i run the code.
in the file "data_reader_DNN_v2", in next_batch, self._input_spec_list[self._num_file], when i run it ,it reveals a IndexError : list index out of range.
i think that it is because there is no .txt file in VAD/sample_data
i don't know if this is true，hope for your answer.

Questions understanding bdnn_transform function.

Hi.
I'm trying to rewrite this project in C++ in search of better interoperability, better user friendliness and better performance.

Now I successfully implemented MRCG extraction and get a huge quality boost as well as a small memory usage. However I have some problem understanding the scripts that does the prediction. This script involves lots of array allocating and I want to know the purpose of every single line in order to write better implementation.

So, could you please kindly give an explanation of the bdnn_transform function?

def bdnn_transform(inputs, w, u):

    # """
    # :param inputs. shape = (batch_size, feature_size)
    # :param w : decide neighbors
    # :param u : decide neighbors
    # :return: trans_inputs. shape = (batch_size, feature_size*len(neighbors))
    # """

    neighbors_1 = np.arange(-w, -u, u)
    neighbors_2 = np.array([-1, 0, 1])
    neighbors_3 = np.arange(1+u, w+1, u)

    neighbors = np.concatenate((neighbors_1, neighbors_2, neighbors_3), axis=0)

    pad_size = 2*w + inputs.shape[0]
    pad_inputs = np.zeros((pad_size, inputs.shape[1]))
    pad_inputs[0:inputs.shape[0], :] = inputs

    trans_inputs = [
        np.roll(pad_inputs, -1*neighbors[i], axis=0)[0:inputs.shape[0], :]
                    for i in range(neighbors.shape[0])]

    trans_inputs = np.asarray(trans_inputs)
    trans_inputs = np.transpose(trans_inputs, [1, 0, 2])
    trans_inputs = np.reshape(trans_inputs, (trans_inputs.shape[0], -1))

    return trans_inputs

Thanks in advance.

Could you please supply the trained model ?

Hi,
See you again.
How is everything going ?

Could you please supply the trained model,I just want to have a test ,
emm,Could the model strip the silence ? Or How could I realize ?

Thx

Can anyone share the noisy data?

Can anyone share the noisy data, General Series 6000 Combo and NOISEX-92.

11 Consider a cloud CI service

Given public open source projects have free support for cloud ci services, consider leveraging one of them here. 🎉

https://blogs.mathworks.com/developer/2020/12/15/cloud-ci-services/

The deference of VAD_LSTM and VAD_LSTM2

Firstly, thanks for your contribution in VAD field! I have two question:

wether the train data of net input is the MRCG feature？
what is the deference of VAD_LSTM and VAD_LSTM2 ?
tks!

Exception while executing vad_test.py.

Traceback (most recent call last):
File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 1596, in
globals = debugger.run(setup['file'], None, None, is_module)
File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/usr/lib/pycharm-community/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/VAD_test.py", line 62, in
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/graph_test.py", line 89, in do_test
valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/data_reader_bDNN_v2.py", line 94, in next_batch
self._input_spec_list[self._num_file]), batch_size, self._w)
IndexError: list index out of range

---> There is no txt file in input_dir, so when 0th index was accessed, Exception occured.
self._input_spec_list = sorted(glob.glob(input_dir+'/*.txt'))

def next_batch(self, batch_size):
    if self._start_idx == self._w:
        self._inputs = self._padding(
            self._read_input(self._input_file_list[self._num_file],
                             self._input_spec_list[self._num_file]), batch_size, self._w)

Can you upload the text file in input_dir?
Also, can you give more info regarding data_len = int(arg) which is passed as a parameter to graph_test.do_test()?
Do you have comparison results of this VAD algorithm with webrtc VAD? If so, can you post it here

Readme file for train.py

Hi,

Do you have a readme file for train.py. I want to train the model on my data and was looking forward to some help on the command line arguments for the same.

Thanks.

Error while running main.py file

Traceback (most recent call last):
File "./lib/python/VAD_test.py", line 60, in
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
File "/home/rajeev/pb/VAD/lib/python/graph_test.py", line 89, in do_test
valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
File "/home/rajeev/pb/VAD/lib/python/data_reader_bDNN_v2.py", line 144, in next_batch
inputs = utils.bdnn_transform(inputs, self._w, self._u)
AttributeError: module 'lib.matlab_py.utils' has no attribute 'bdnn_transform'

utils file does not contain any function with name bdnn_transform.
Any help ??

How do the results have to be interpreted

Hello!

I tried to use the python implementation to detect voice for the first 100s from this video:
https://www.youtube.com/watch?v=gYdHyeo0eec

And these are the results on the spectogram:

First of all, why are there positive results during the first 47 seconds? Is it just the model not being trained to disregard music?
And secondly, is there a way to get results merged together whenever it detects voice? so that there won't be intervals of just fractions of a second one after another?

Thanks very much in advance!

Train

could you give the details about training the NN.

Training Datamake

As per your comment in one of the closed issue, you mentioned that you concatenate different sound effects to make one long sound wave containing noises and then pick a random speech utterance and add that speech utterance to noise files at various SNRs until the end of the long sound wave of noises.

But the datamake you have uploaded in the speech enhancement toolkit does something different. It picks up random intervals from the long concatenated sound wave containing noises and mix it with different sound files.

So in the second case, one utterance of speech file doesn't get added to whole of the long concatenated noise. Instead a random interval of the long concatenated noise gets mixed with each sound file.

Can you explain why did you take first approach to create the dataset for training the VAD model? And second, how can I do the same thing you are doing? Should I use FaNT ? Or your make_train_noisy.m has options to do so?

Any usage example of VAD_test.py?

Since there are many issues related to the py branch, I am wondering whether anyone could offer a usage example of VAD_test.py(plus some detail explanation would be nice)?

Thanks.

IndexError: list index out of range

When I running the 'main.m' on Matlab, an error occurs:

Traceback (most recent call last):
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/VAD_test.py", line 102, in <module>
    pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
IndexError: list index out of range

Then I try to debug the 'VAD_test.py' with PyCharm:

Traceback (most recent call last):
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/VAD_test.py", line 102, in <module>
    pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/graph_test.py", line 195, in do_test
    valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/data_reader_DNN_v2.py", line 81, in next_batch
    self._input_spec_list[self._num_file]), batch_size, self._w)
IndexError: list index out of range

I find the 'self._input_spec_list' is empty, and it may be caused by this:
self._input_spec_list = sorted(glob.glob(input_dir+'/*.txt'))
I don't find any '.txt' file exist in the project, is it missing?

Anyone has VAD annotated Mandarin dataset?

Does anyone has Mandarin dataset with VAD annotation?

I used my own dataset to train the model but got error in Dimension

I have prepared my data similar to the data used in this model with 1s time duration for each file and 16000 for sample rate and define each sound file into a folder,

Now I am facing this error in building and training the model( InvalidArgumentError: Dimension -31998 must be >= 0
[[{{node zeros}}]] [Op:IteratorGetNext])

can you help me?

Meaning of different functions

Thanks for the code repo!

Q1. Could you please explain the meaning of the following functions in VAD/utils.py?
-Truelabel2Trueframe
-frame2rawlabel
-vad_post

What exactly are these functions doing?

Really appreciate if you could reply ASAP. The greater the detail you can provide, the better. Thanks a lot!

image 1

Question: Why was MRCG selected as input feature?

Hi, recently I've been looking for deep-learning based VAD models and some googling brought me here. Thanks for open-sourcing your model! :)

My question is: why was MRCG used as an input feature?

To the best of my knowledge, STFT based mel-spectrograms (or linear-scale magnitudes, whatever) have been widely used as an input feature of recent deep-learning based acoustic models. Are there any strengths that MRCG have in VAD model, compared to other acoustic features like mel-spectrograms?

Process many audio files with the VAD model

According to my understanding, when an audio file is passed to the VAD model to be analyzed, the DNN (as well as other models) gets loaded to do the job, but when a second audio file is passed to the VAD, the module gets reloaded, which takes up a lot of time. Is there any way to load the module only once and have the module predict on many audio files?

Thanks!

Training data

Hello, I am trying to use bDNN to distinguish human voice and natural sound.
Could you tell me how many data you used for training the NN, and what kind data should be involved in the training set except the example data you have given, like noise data (dog's bark, knock, etc.) ?

Memory leak for larger audio files

This is on the py branch.

I tried to use the park.wav audio file from the recorded dataset:

audio_dir = './data/recorded_data/park.wav'

but when I run the test:

python3 main.py

during the MRCG extraction step, I quickly run out of all available RAM, even though I have 32GB.

After this the script exits with:

fftfilter
73.884435
Traceback (most recent call last):
File "main.py", line 20, in
result = utils.vad_func(audio_dir, mode, th, output_type, is_default)
File "/home/VAD/lib/matlab_py/utils.py", line 170, in vad_func
data_len, winlen, winstep = mrcg_extract(audio_dir)
File "/home/VAD/lib/matlab_py/utils.py", line 142, in mrcg_extract
mrcg_mat = np.transpose(mrcg.mrcg_features(noisy_speech, audio_sr))
File "/home/VAD/lib/matlab_py/mrcg.py", line 24, in mrcg_features
cochlea1 = np.log10(cochleagram(g, int(sampFreq * 0.025), int(sampFreq * 0.010)))
File "/home/VAD/lib/matlab_py/mrcg.py", line 198, in cochleagram
rs = np.square(r)
MemoryError

Hi I want to know why you multiply "TrueLabel_bin[iidx:iidx + wsize - 1]" by 10 in func Truelabel2Trueframe

`def Truelabel2Trueframe(TrueLabel_bin, wsize, wstep):
iidx = 0
Frame_iidx = 0
Frame_len = Frame_Length(TrueLabel_bin, wstep, wsize)
Detect = np.zeros([Frame_len, 1])
while 1:
if iidx + wsize <= len(TrueLabel_bin):
TrueLabel_frame = TrueLabel_bin[iidx:iidx + wsize - 1] * 10
else:
TrueLabel_frame = TrueLabel_bin[iidx:] * 10

    if (np.sum(TrueLabel_frame) >= wsize / 2):
        TrueLabel_frame = 1
    else:
        TrueLabel_frame = 0

    if (Frame_iidx >= len(Detect)):
        break

    Detect[Frame_iidx] = TrueLabel_frame
    iidx = iidx + wstep
    Frame_iidx = Frame_iidx + 1
    if (iidx > len(TrueLabel_bin)):
        break

return Detect`

I want to know why you multiply "TrueLabel_bin[iidx:iidx + wsize - 1]" by 10

jtkim-kaist / vad Goto Github PK

vad's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs