jtkim-kaist / vad Goto Github PK
View Code? Open in Web Editor NEWVoice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
I cannot download the dataset. The link is unreachable. Can you share the dataset at another link?
I recently just searched the term "VAD" in github search and found many abandoned projects with some decent amount of traction, but mostly lacking pre-trained models.
So I decided to share our new pre-trained VAD:
I am wondering how to create the training dataset. I have understood the format but don't know how to create one. Do we have to manually annotate the training dataset. Manually annotation would be difficult, is there any utility which can create an approximate training data which can then refined manually.
Is there a way in which Octave can be used instead of Matlab for running the main.m file?
When I try to run it using Octave I get:
warning: called from
main at line 5 column 1
warning: function ./lib/matlab/stft/example.m shadows a core library function
warning: called from
main at line 5 column 1
MRCG extraction ...
ans = 1
ans = 1
warning: load: '/home/vlad/Downloads/VAD/lib/matlab/MRCG_features/f_af_bf_cf.mat' found by searching load path
warning: called from
loudness at line 12 column 1
gammatone at line 35 column 10
MRCG_features at line 7 column 3
mrcg_extract at line 26 column 14
vad_func at line 7 column 30
main at line 21 column 12
.........................
error: fftfilt: B must be a vector
error: execution exception in main.m
I have not been able to understand the way training data should be specified.
Like how the labels should be written. Do we need to specify the time at which
labels occur in the sound file. If yes how and where
Hi Kim,
I hope you have a good weekend! When you have a minute, would you please take a look at my questions about your code and let me know what you think of it? Here is the question:
I have run you DNN code with my own timit data and changed the iteration time to get more data as shown in the attachment VAD_DNN.py。However, the valid accuracy is higher than the training accuracy, which is very strange. And the accuracy and cost of training data are fluctuating dramatically. For example, the training accuracy is 47% while the valid is 96% at the last point. Could you explain it?
I also trained my own DNN without using feature extraction and got the opposite result that the accuracy and cost of valid data are fluctuating dramatically. It seems that there is an overfitting problem. The dropout rate is already 0.7. Could you give me some advice
Hi,
I am a DSP engineer and acoustician. I'm new in the field of machine learning and neural networks and I am learning a lot working on your code :).
I am trying to understand the feasibility to squeeze your VAD graph architecture to be used in real-time. Right now I am able to retrain the neural net modifying a few parameters (mainly the frame window and overlap) using the D2 dataset in your paper. Although, it looks like the best ways to improve the graph computation are:
I 've tried to perform this training multiple times for different batch sizes (2048,1024,512,...) and I failed so far in this quest. Sometimes the accuracy of the training achieves a high value but it never generalizes for the test data.
I believe something is wrong on the training parameters and that's because I am contacting you for guidelines. Did you ever try to train the neural net for smaller batch sizes? Something should be changed on the net architecture or on the training parameters?
What is the recommended relative size of the audio to be tested over batch size? More clearly: may I test a graph trained with a large batch size with a very small audio sample? My intuition is that it is not possible. That's because I would like to retrain the net with a smaller batch size so that I could perform sequential tests with an audio buffer of reasonable size (let's say 500 ms).
Finally, do you have any recommendation for simplifying the network computational complexity without sacrificing too much in performance?
I appreciate your time in sharing your knowledge and experience,
Cheers,
Lucas
Hello Kim.
Using the MATLAB MRCG proved to take a lot of RAM.
I can not use a bigger than 5Min file as an input with 16GB RAM system.
In the paper it was written that a very long sound file was used for train, so how could you compute the MRCG for it?
Thanks, Ravid.
Hi @jtkim-kaist ,
Thanks for your great project, it helps me much.
I have several questions about this project, Could you help me? Thanks you in advance.
1.1 this imag, it is TIMIT test data, i use your saved model to run it, from 1.4s to 1.5s, the output probability is still very high. is it right? I think the probability should be reduce in this period.
1.2 this is your clean_speech.wav, the green line is label. from 2.73s to 2.83s, it is >= 1 frame length, but all of them are labeled as speech, is it right?
Thanks
Jinhong
Hi everyone,
I added 1s silence before and after each utterance from TIMIT dataset, however, the ACAM and bDNN model couldn't learn from the training data. Instead, it simply predicts all samples to 1 (as shown in the following pictures).
The training data is sampled at 16000 and labeled according to .phn descriptions(see also the following pictures). Does anyone have ideas how to fix that?
examples of training data: https://mcgill-my.sharepoint.com/:u:/g/personal/yifei_zhao_mail_mcgill_ca/EV0mKeH4U7BFpW_ZmyRBQZQBCDSP0quq4rgVsX0CtNlXfw?e=LfW8oJ
Thx!!!
Hello
Is the model compatible with tensorflow lite?
I tried converting it but received the following warning:
Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If you have a custom implementation for them you can disable this error with --allow_custom_ops. Here is a list of operators for which you will need custom implementations: Enter, Exit, FLOOR, LoopCond, RandomUniform, Range, TensorArrayGatherV3, TensorArrayReadV3, TensorArrayScatterV3, TensorArraySizeV3, TensorArrayV3, TensorArrayWriteV3, TensorFlowLess, TensorFlowMerge, TensorFlowSwitch.
Are you planning to do it in the future?
Thanks.
anyone can you give me a steps to test the pretrained model
Hi Kim,
Apologize for disturbing you for many times, but I have problem understanding your normalization code. I found some code in the acoustic_feat_ex.m
:
%% Save global normalization factor
global_mean = train_mean / length(audio_list);
global_std = train_std / length(audio_list);
save([save_dir, '/global_normalize_factor'], 'global_mean', 'global_std');
and in every data_reader_XXX.py
:
norm_param = sio.loadmat(self._norm_dir+'/global_normalize_factor.mat')
self.train_mean = norm_param['global_mean']
self.train_std = norm_param['global_std']
My questions are:
acoustic_feat_ex.m
? Why don't calculate factor for every single train file and apply normalization on it?data_reader_XXX.py
s are also used during the prediction)? Is this a mistake?Thanks in advance!
Charlie Jiang
Hi dear
I ran your reposity , but it shown some error .
I find there is no data under data/train/feat/ directory
can you help with me?
Thanks
weizhe
Hi,Kim,I'm just a new beginner,and i meet some problems when i run the code.
in the file "data_reader_DNN_v2", in next_batch, self._input_spec_list[self._num_file], when i run it ,it reveals a IndexError : list index out of range.
i think that it is because there is no .txt file in VAD/sample_data
i don't know if this is true,hope for your answer.
Hi.
I'm trying to rewrite this project in C++ in search of better interoperability, better user friendliness and better performance.
Now I successfully implemented MRCG extraction and get a huge quality boost as well as a small memory usage. However I have some problem understanding the scripts that does the prediction. This script involves lots of array allocating and I want to know the purpose of every single line in order to write better implementation.
So, could you please kindly give an explanation of the bdnn_transform function?
def bdnn_transform(inputs, w, u):
# """
# :param inputs. shape = (batch_size, feature_size)
# :param w : decide neighbors
# :param u : decide neighbors
# :return: trans_inputs. shape = (batch_size, feature_size*len(neighbors))
# """
neighbors_1 = np.arange(-w, -u, u)
neighbors_2 = np.array([-1, 0, 1])
neighbors_3 = np.arange(1+u, w+1, u)
neighbors = np.concatenate((neighbors_1, neighbors_2, neighbors_3), axis=0)
pad_size = 2*w + inputs.shape[0]
pad_inputs = np.zeros((pad_size, inputs.shape[1]))
pad_inputs[0:inputs.shape[0], :] = inputs
trans_inputs = [
np.roll(pad_inputs, -1*neighbors[i], axis=0)[0:inputs.shape[0], :]
for i in range(neighbors.shape[0])]
trans_inputs = np.asarray(trans_inputs)
trans_inputs = np.transpose(trans_inputs, [1, 0, 2])
trans_inputs = np.reshape(trans_inputs, (trans_inputs.shape[0], -1))
return trans_inputs
Thanks in advance.
Hi,
See you again.
How is everything going ?
Could you please supply the trained model,I just want to have a test ,
emm,Could the model strip the silence ? Or How could I realize ?
Thx
Can anyone share the noisy data, General Series 6000 Combo and NOISEX-92.
Given public open source projects have free support for cloud ci services, consider leveraging one of them here. 🎉
https://blogs.mathworks.com/developer/2020/12/15/cloud-ci-services/
Firstly, thanks for your contribution in VAD field! I have two question:
Traceback (most recent call last):
File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 1596, in
globals = debugger.run(setup['file'], None, None, is_module)
File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/usr/lib/pycharm-community/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/VAD_test.py", line 62, in
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/graph_test.py", line 89, in do_test
valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/data_reader_bDNN_v2.py", line 94, in next_batch
self._input_spec_list[self._num_file]), batch_size, self._w)
IndexError: list index out of range
---> There is no txt file in input_dir, so when 0th index was accessed, Exception occured.
self._input_spec_list = sorted(glob.glob(input_dir+'/*.txt'))
def next_batch(self, batch_size):
if self._start_idx == self._w:
self._inputs = self._padding(
self._read_input(self._input_file_list[self._num_file],
self._input_spec_list[self._num_file]), batch_size, self._w)
Hi,
Do you have a readme file for train.py. I want to train the model on my data and was looking forward to some help on the command line arguments for the same.
Thanks.
Traceback (most recent call last):
File "./lib/python/VAD_test.py", line 60, in
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
File "/home/rajeev/pb/VAD/lib/python/graph_test.py", line 89, in do_test
valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
File "/home/rajeev/pb/VAD/lib/python/data_reader_bDNN_v2.py", line 144, in next_batch
inputs = utils.bdnn_transform(inputs, self._w, self._u)
AttributeError: module 'lib.matlab_py.utils' has no attribute 'bdnn_transform'
utils file does not contain any function with name bdnn_transform.
Any help ??
Hello!
I tried to use the python implementation to detect voice for the first 100s from this video:
https://www.youtube.com/watch?v=gYdHyeo0eec
And these are the results on the spectogram:
First of all, why are there positive results during the first 47 seconds? Is it just the model not being trained to disregard music?
And secondly, is there a way to get results merged together whenever it detects voice? so that there won't be intervals of just fractions of a second one after another?
Thanks very much in advance!
could you give the details about training the NN.
As per your comment in one of the closed issue, you mentioned that you concatenate different sound effects to make one long sound wave containing noises and then pick a random speech utterance and add that speech utterance to noise files at various SNRs until the end of the long sound wave of noises.
But the datamake you have uploaded in the speech enhancement toolkit does something different. It picks up random intervals from the long concatenated sound wave containing noises and mix it with different sound files.
So in the second case, one utterance of speech file doesn't get added to whole of the long concatenated noise. Instead a random interval of the long concatenated noise gets mixed with each sound file.
Can you explain why did you take first approach to create the dataset for training the VAD model? And second, how can I do the same thing you are doing? Should I use FaNT ? Or your make_train_noisy.m has options to do so?
Since there are many issues related to the py
branch, I am wondering whether anyone could offer a usage example of VAD_test.py
(plus some detail explanation would be nice)?
Thanks.
When I running the 'main.m' on Matlab, an error occurs:
Traceback (most recent call last):
File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/VAD_test.py", line 102, in <module>
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
IndexError: list index out of range
Then I try to debug the 'VAD_test.py' with PyCharm:
Traceback (most recent call last):
File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/VAD_test.py", line 102, in <module>
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/graph_test.py", line 195, in do_test
valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/data_reader_DNN_v2.py", line 81, in next_batch
self._input_spec_list[self._num_file]), batch_size, self._w)
IndexError: list index out of range
I find the 'self._input_spec_list' is empty, and it may be caused by this:
self._input_spec_list = sorted(glob.glob(input_dir+'/*.txt'))
I don't find any '.txt' file exist in the project, is it missing?
Does anyone has Mandarin dataset with VAD annotation?
I have prepared my data similar to the data used in this model with 1s time duration for each file and 16000 for sample rate and define each sound file into a folder,
Now I am facing this error in building and training the model( InvalidArgumentError: Dimension -31998 must be >= 0
[[{{node zeros}}]] [Op:IteratorGetNext])
can you help me?
Thanks for the code repo!
Q1. Could you please explain the meaning of the following functions in VAD/utils.py?
-Truelabel2Trueframe
-frame2rawlabel
-vad_post
What exactly are these functions doing?
Really appreciate if you could reply ASAP. The greater the detail you can provide, the better. Thanks a lot!
Hi, recently I've been looking for deep-learning based VAD models and some googling brought me here. Thanks for open-sourcing your model! :)
My question is: why was MRCG used as an input feature?
To the best of my knowledge, STFT based mel-spectrograms (or linear-scale magnitudes, whatever) have been widely used as an input feature of recent deep-learning based acoustic models. Are there any strengths that MRCG have in VAD model, compared to other acoustic features like mel-spectrograms?
According to my understanding, when an audio file is passed to the VAD model to be analyzed, the DNN (as well as other models) gets loaded to do the job, but when a second audio file is passed to the VAD, the module gets reloaded, which takes up a lot of time. Is there any way to load the module only once and have the module predict on many audio files?
Thanks!
Hello, I am trying to use bDNN to distinguish human voice and natural sound.
Could you tell me how many data you used for training the NN, and what kind data should be involved in the training set except the example data you have given, like noise data (dog's bark, knock, etc.) ?
This is on the py
branch.
I tried to use the park.wav
audio file from the recorded dataset:
audio_dir = './data/recorded_data/park.wav'
but when I run the test:
python3 main.py
during the MRCG extraction
step, I quickly run out of all available RAM, even though I have 32GB.
After this the script exits with:
fftfilter
73.884435
Traceback (most recent call last):
File "main.py", line 20, in
result = utils.vad_func(audio_dir, mode, th, output_type, is_default)
File "/home/VAD/lib/matlab_py/utils.py", line 170, in vad_func
data_len, winlen, winstep = mrcg_extract(audio_dir)
File "/home/VAD/lib/matlab_py/utils.py", line 142, in mrcg_extract
mrcg_mat = np.transpose(mrcg.mrcg_features(noisy_speech, audio_sr))
File "/home/VAD/lib/matlab_py/mrcg.py", line 24, in mrcg_features
cochlea1 = np.log10(cochleagram(g, int(sampFreq * 0.025), int(sampFreq * 0.010)))
File "/home/VAD/lib/matlab_py/mrcg.py", line 198, in cochleagram
rs = np.square(r)
MemoryError
`def Truelabel2Trueframe(TrueLabel_bin, wsize, wstep):
iidx = 0
Frame_iidx = 0
Frame_len = Frame_Length(TrueLabel_bin, wstep, wsize)
Detect = np.zeros([Frame_len, 1])
while 1:
if iidx + wsize <= len(TrueLabel_bin):
TrueLabel_frame = TrueLabel_bin[iidx:iidx + wsize - 1] * 10
else:
TrueLabel_frame = TrueLabel_bin[iidx:] * 10
if (np.sum(TrueLabel_frame) >= wsize / 2):
TrueLabel_frame = 1
else:
TrueLabel_frame = 0
if (Frame_iidx >= len(Detect)):
break
Detect[Frame_iidx] = TrueLabel_frame
iidx = iidx + wstep
Frame_iidx = Frame_iidx + 1
if (iidx > len(TrueLabel_bin)):
break
return Detect`
I want to know why you multiply "TrueLabel_bin[iidx:iidx + wsize - 1]" by 10
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.