jtkim-kaist / vad Goto Github PK

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.

Python 14.77% MATLAB 84.41% HTML 0.78% M 0.01% Shell 0.04%

vad dnn lstm bdnn acam attention speech data voice-detection speech-recognition

vad's Introduction

Voice Activity Detection Toolkit

This toolkit provides the voice activity detection (VAD) code and our recorded dataset.

Update

2019-02-11

Accepted for the presentation of this toolkit in ICASSP 2019!

2018-12-11

Post processing is updated.

2018-06-04

Good news! we have uploaded speech enhancement toolkit based on deep neural network. This toolkit provides several useful things such as data generation script. You can find this toolkit in here

2018-04-09

The test sciprt fully written by python has been uploaded in 'py' branch.

Introduction

VAD toolkit in this project was used in the paper:

J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1-1.

URL: https://ieeexplore.ieee.org/document/8309294/

If our VAD toolkit supports your research, we are very appreciated if you cite this paper.

ACAM is based on the recurrent attention model (RAM) [1] and the implementation of RAM can be found in jlindsey15 and jtkim-kaist's repository.

VAD in this toolkit follows the procedure as below:

Acoustic feature extraction

In this toolkit, we use the multi-resolution cochleagram (MRCG) [2] for the acoustic feature implemented by matlab. Note that MRCG extraction time is relatively long compared to the classifier.

Classifier

This toolkit supports 4 types of MRCG based classifer implemented by python with tensorflow as follows:

Adaptive context attention model (ACAM)
Boosted deep neural network (bDNN) [2]
Deep neural network (DNN) [2]
Long short term memory recurrent neural network (LSTM-RNN) [3]

Prerequisites

Python 3
Tensorflow 1.1-3
Matlab 2017b (will be depreciated)

Example

The default model provided in this toolkit is the trained model using our dataset. The used dataset is described in our submitted paper. The example matlab script is main.m. Just run it on the matlab. The result will be like following figure.

Note: To apply this toolkit to other speech data, the speech data should be sampled with 16kHz sampling frequency.

Post processing

Many people want to the post-processing so I updated.

In py branch, you can see some parameters in utils.vad_func in main.py

Each parameter can handle following errors.

FEC: hang_before

MSC: off_on_length

OVER: hang_over

NDS: on_off_length

Note that there is NO optimal one. The optimal parameter set is according to the application.

Enjoy.

Training

We attached the sample database to 'path/to/project/data/raw'. Please refer to the database for understanding the data format.
The model specifications are described in 'path/to/project/configure'.
The training procedure has 2 steps: (i) MRCG extraction; (ii) Model training.

Note: Do not forget adding the path to this project in the matlab.

# train.sh
# train script options
# m 0 : ACAM
# m 1 : bDNN
# m 2 : DNN
# m 3 : LSTM
# e : extract MRCG feature (1) or not (0)

python3 $train -m 0 -e 1 --prj_dir=$curdir

Recorded Dataset

Our recored dataset is freely available: Download

Specification

Environments

Bus stop, construction site, park, and room.

Recording device

A smart phone (Samsung Galaxy S8)

At each environment, conversational speech by two Korean male speakers was recorded. The ground truth labels are manually annotated. Because the recording was carried out in the real world, unexpected noises are included to the dataset such as the crying of baby, the chirping of insects, mouse click sound, and etc. The details of dataset is described in the following table:

	Bus stop	Cons. site	Park	Room	Overall
Dur. (min)	30.02	30.03	30.07	30.05	120.17
Avg. SNR (dB)	5.61	2.05	5.71	18.26	7.91
% of speech	40.12	26.71	26.85	30.44	31.03

TODO List

Although MRCG show good performance but extraction time is somewhat long, therefore we will substitute it to other feature such as spectrogram.

Trouble Shooting

If you find any errors in the code, please contact to us.

E-mail: [email protected]

Copyright

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

References

[1] J. Ba, V. Minh, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv, 1412.7755, 2014.

[2] Zhang, Xiao-Lei, and DeLiang Wang. “Boosting contextual information for deep neural network based voice activity detection,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, no. 2, pp. 252-264, 2016.

[3] Zazo Candil, Ruben, et al. “Feature learning with raw-waveform CLDNNs for Voice Activity Detection.”, 2016.

Acknowledgement

Jaeseok, Kim (KAIST) contributed to this project for changing matlab script to python.

vad's People

Contributors

Stargazers

Watchers

Forkers

leekinboo windclay cdyangbo chenbangfeng xshwen ahikaml orctom james-lh vsooda glynpu thecacophonyproject vdante maitchison del18687058912 18307612949 fjibj fancyerii bookong22 danhuixie moplast zhaoforever susannawull chenxinglili nemocpp guenchi shimoli15 lijinfeng0713 plume shubhampachori12110095 itcolossus liuyanfeier fanwei918 michelleyang2017 angenge gaoyiyeah leningf lzufalcon chr1st0p normonisping jinjinhan qjj617 amand4msk junshipeng gdy1201 lbqin abc3436645 audiobucket alongwithyou owlyone wxysunshy ajitaru makinglong machinelearningisgood dansuh17 linjucs nd1511 iandxx orangebaowang stevenlol zhuleiustc binglel superever daisey666 xingzai0617 yh646492956 pplus 7yen lvchigo tudingneau hongpeng1992 xcalibur12 mbuccoli xdcesc ooooooooya 00001101-xt wanglzjerria kingfener whiteweak byfaith albertoesc jeryb pbskumar whu933314 sharonjl fanlu quockhanhle93 xiongmaoxia ericustc windowxiaoming sadam1195 dachengai dennistang742 caopan16 forzalife wikipedia2008 fdeng1983 cddypang cathy-kim simpleishappy dghcs

vad's Issues

Process many audio files with the VAD model

According to my understanding, when an audio file is passed to the VAD model to be analyzed, the DNN (as well as other models) gets loaded to do the job, but when a second audio file is passed to the VAD, the module gets reloaded, which takes up a lot of time. Is there any way to load the module only once and have the module predict on many audio files?

Thanks!

Tensorflow Lite Compatibility

Hello
Is the model compatible with tensorflow lite?

I tried converting it but received the following warning:
Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If you have a custom implementation for them you can disable this error with --allow_custom_ops. Here is a list of operators for which you will need custom implementations: Enter, Exit, FLOOR, LoopCond, RandomUniform, Range, TensorArrayGatherV3, TensorArrayReadV3, TensorArrayScatterV3, TensorArraySizeV3, TensorArrayV3, TensorArrayWriteV3, TensorFlowLess, TensorFlowMerge, TensorFlowSwitch.

Are you planning to do it in the future?

Thanks.

Can anyone share the noisy data?

Can anyone share the noisy data, General Series 6000 Combo and NOISEX-92.

11 how to test the single wav audio in pretrained model

anyone can you give me a steps to test the pretrained model

image 1

MRCG RAM consumption

Hello Kim.

Using the MATLAB MRCG proved to take a lot of RAM.
I can not use a bigger than 5Min file as an input with 16GB RAM system.
In the paper it was written that a very long sound file was used for train, so how could you compute the MRCG for it?

Thanks, Ravid.

How do the results have to be interpreted

Hello!

I tried to use the python implementation to detect voice for the first 100s from this video:
https://www.youtube.com/watch?v=gYdHyeo0eec

And these are the results on the spectogram:

First of all, why are there positive results during the first 47 seconds? Is it just the model not being trained to disregard music?
And secondly, is there a way to get results merged together whenever it detects voice? so that there won't be intervals of just fractions of a second one after another?

Thanks very much in advance!

Questions about the data normalization

Hi Kim,

Apologize for disturbing you for many times, but I have problem understanding your normalization code. I found some code in the acoustic_feat_ex.m:

%% Save global normalization factor

global_mean = train_mean / length(audio_list);
global_std = train_std / length(audio_list);
save([save_dir, '/global_normalize_factor'], 'global_mean', 'global_std');

and in every data_reader_XXX.py:

norm_param = sio.loadmat(self._norm_dir+'/global_normalize_factor.mat')
self.train_mean = norm_param['global_mean']
self.train_std = norm_param['global_std']

My questions are:

Is a global normalize factor for the whole dataset saved in acoustic_feat_ex.m? Why don't calculate factor for every single train file and apply normalization on it?
If so, why this factor is used also during the prediction phase (because data_reader_XXX.pys are also used during the prediction)? Is this a mistake?

Thanks in advance!
Charlie Jiang

Exception while executing vad_test.py.

Traceback (most recent call last):
File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 1596, in
globals = debugger.run(setup['file'], None, None, is_module)
File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/usr/lib/pycharm-community/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/VAD_test.py", line 62, in
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/graph_test.py", line 89, in do_test
valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
File "/home/kbarakam/Desktop/VAD/VAD/lib/python/data_reader_bDNN_v2.py", line 94, in next_batch
self._input_spec_list[self._num_file]), batch_size, self._w)
IndexError: list index out of range

---> There is no txt file in input_dir, so when 0th index was accessed, Exception occured.
self._input_spec_list = sorted(glob.glob(input_dir+'/*.txt'))

def next_batch(self, batch_size):
    if self._start_idx == self._w:
        self._inputs = self._padding(
            self._read_input(self._input_file_list[self._num_file],
                             self._input_spec_list[self._num_file]), batch_size, self._w)

Can you upload the text file in input_dir?
Also, can you give more info regarding data_len = int(arg) which is passed as a parameter to graph_test.do_test()?
Do you have comparison results of this VAD algorithm with webrtc VAD? If so, can you post it here

How to create the dataset

I am wondering how to create the training dataset. I have understood the format but don't know how to create one. Do we have to manually annotate the training dataset. Manually annotation would be difficult, is there any utility which can create an approximate training data which can then refined manually.

Memory leak for larger audio files

This is on the py branch.

I tried to use the park.wav audio file from the recorded dataset:

audio_dir = './data/recorded_data/park.wav'

but when I run the test:

python3 main.py

during the MRCG extraction step, I quickly run out of all available RAM, even though I have 32GB.

After this the script exits with:

fftfilter
73.884435
Traceback (most recent call last):
File "main.py", line 20, in
result = utils.vad_func(audio_dir, mode, th, output_type, is_default)
File "/home/VAD/lib/matlab_py/utils.py", line 170, in vad_func
data_len, winlen, winstep = mrcg_extract(audio_dir)
File "/home/VAD/lib/matlab_py/utils.py", line 142, in mrcg_extract
mrcg_mat = np.transpose(mrcg.mrcg_features(noisy_speech, audio_sr))
File "/home/VAD/lib/matlab_py/mrcg.py", line 24, in mrcg_features
cochlea1 = np.log10(cochleagram(g, int(sampFreq * 0.025), int(sampFreq * 0.010)))
File "/home/VAD/lib/matlab_py/mrcg.py", line 198, in cochleagram
rs = np.square(r)
MemoryError

Dataset download link is unreachable

I cannot download the dataset. The link is unreachable. Can you share the dataset at another link?

question about DNN learning curve

Hi Kim,
I hope you have a good weekend! When you have a minute, would you please take a look at my questions about your code and let me know what you think of it? Here is the question:

I have run you DNN code with my own timit data and changed the iteration time to get more data as shown in the attachment VAD_DNN.py。However, the valid accuracy is higher than the training accuracy, which is very strange. And the accuracy and cost of training data are fluctuating dramatically. For example, the training accuracy is 47% while the valid is 96% at the last point. Could you explain it?
I also trained my own DNN without using feature extraction and got the opposite result that the accuracy and cost of valid data are fluctuating dramatically. It seems that there is an overfitting problem. The dropout rate is already 0.7. Could you give me some advice

I really appreciate your feedback

data and model question

Hi @jtkim-kaist ,

Thanks for your great project, it helps me much.

I have several questions about this project, Could you help me? Thanks you in advance.

your prepared data, the label is not that accurate. the short silence in middle of whole speech is labeld as speech. i measured the length, sometime it exceed 100ms, does it degrade the performance? see the below two figure.
for TIMIT, it has *.phn, *.txt and *.wrd file for each audio. My question is how to label the data? do you label the whole audio as speech or use *.wrd to label each word?
the normalization. I find you get a mean and variance for each feature in the whole dataset
in Truelabel2Trueframe.m, line 13, I don't know why you *10. the input is 0 or 1, Not 0.1
In the ACAM model, I found the final full connected layer output is 7 dimension. and it is same as the input frame number, the activation function is sigmoid to get logit, and use tf.square(logit-labels) to calculate the cost function. My question is, if we set the connected layer output as 2 dimension, then softmax it, then use entropy cost function, the label can be 1 if sum(labels(n:n+6))>3, is it OK? it is very popular for classification task.

1.1 this imag, it is TIMIT test data, i use your saved model to run it, from 1.4s to 1.5s, the output probability is still very high. is it right? I think the probability should be reduce in this period.

1.2 this is your clean_speech.wav, the green line is label. from 2.73s to 2.83s, it is >= 1 frame length, but all of them are labeled as speech, is it right?

Thanks
Jinhong

Can Octave be used?

Is there a way in which Octave can be used instead of Matlab for running the main.m file?
When I try to run it using Octave I get:

warning: called from
    main at line 5 column 1
warning: function ./lib/matlab/stft/example.m shadows a core library function
warning: called from
    main at line 5 column 1
MRCG extraction ...
ans = 1
ans = 1
warning: load: '/home/vlad/Downloads/VAD/lib/matlab/MRCG_features/f_af_bf_cf.mat' found by searching load path
warning: called from
    loudness at line 12 column 1
    gammatone at line 35 column 10
    MRCG_features at line 7 column 3
    mrcg_extract at line 26 column 14
    vad_func at line 7 column 30
    main at line 21 column 12
.........................
error: fftfilt: B must be a vector
error: execution exception in main.m

Train

could you give the details about training the NN.

Training data

Hello, I am trying to use bDNN to distinguish human voice and natural sound.
Could you tell me how many data you used for training the NN, and what kind data should be involved in the training set except the example data you have given, like noise data (dog's bark, knock, etc.) ?

Meaning of different functions

Thanks for the code repo!

Q1. Could you please explain the meaning of the following functions in VAD/utils.py?
-Truelabel2Trueframe
-frame2rawlabel
-vad_post

What exactly are these functions doing?

Really appreciate if you could reply ASAP. The greater the detail you can provide, the better. Thanks a lot!

IndexError: list index out of range

When I running the 'main.m' on Matlab, an error occurs:

Traceback (most recent call last):
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/VAD_test.py", line 102, in <module>
    pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
IndexError: list index out of range

Then I try to debug the 'VAD_test.py' with PyCharm:

Traceback (most recent call last):
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/VAD_test.py", line 102, in <module>
    pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/graph_test.py", line 195, in do_test
    valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
  File "/Users/tianyu/Documents/MATLAB/VAD-master/lib/python/data_reader_DNN_v2.py", line 81, in next_batch
    self._input_spec_list[self._num_file]), batch_size, self._w)
IndexError: list index out of range

I find the 'self._input_spec_list' is empty, and it may be caused by this:
self._input_spec_list = sorted(glob.glob(input_dir+'/*.txt'))
I don't find any '.txt' file exist in the project, is it missing?

Question: Why was MRCG selected as input feature?

Hi, recently I've been looking for deep-learning based VAD models and some googling brought me here. Thanks for open-sourcing your model! :)

My question is: why was MRCG used as an input feature?

To the best of my knowledge, STFT based mel-spectrograms (or linear-scale magnitudes, whatever) have been widely used as an input feature of recent deep-learning based acoustic models. Are there any strengths that MRCG have in VAD model, compared to other acoustic features like mel-spectrograms?

Could you please supply the trained model ?

Hi,
See you again.
How is everything going ?

Could you please supply the trained model,I just want to have a test ,
emm,Could the model strip the silence ? Or How could I realize ?

Thx

Problem with the training data

Hi everyone,

I added 1s silence before and after each utterance from TIMIT dataset, however, the ACAM and bDNN model couldn't learn from the training data. Instead, it simply predicts all samples to 1 (as shown in the following pictures).

The training data is sampled at 16000 and labeled according to .phn descriptions(see also the following pictures). Does anyone have ideas how to fix that?

examples of training data: https://mcgill-my.sharepoint.com/:u:/g/personal/yifei_zhao_mail_mcgill_ca/EV0mKeH4U7BFpW_ZmyRBQZQBCDSP0quq4rgVsX0CtNlXfw?e=LfW8oJ

Thx!!!

Any usage example of VAD_test.py?

Since there are many issues related to the py branch, I am wondering whether anyone could offer a usage example of VAD_test.py(plus some detail explanation would be nice)?

Thanks.

Readme file for train.py

Hi,

Do you have a readme file for train.py. I want to train the model on my data and was looking forward to some help on the command line arguments for the same.

Thanks.

Training Datamake

As per your comment in one of the closed issue, you mentioned that you concatenate different sound effects to make one long sound wave containing noises and then pick a random speech utterance and add that speech utterance to noise files at various SNRs until the end of the long sound wave of noises.

But the datamake you have uploaded in the speech enhancement toolkit does something different. It picks up random intervals from the long concatenated sound wave containing noises and mix it with different sound files.

So in the second case, one utterance of speech file doesn't get added to whole of the long concatenated noise. Instead a random interval of the long concatenated noise gets mixed with each sound file.

Can you explain why did you take first approach to create the dataset for training the VAD model? And second, how can I do the same thing you are doing? Should I use FaNT ? Or your make_train_noisy.m has options to do so?

Consider a cloud CI service

Given public open source projects have free support for cloud ci services, consider leveraging one of them here. 🎉

https://blogs.mathworks.com/developer/2020/12/15/cloud-ci-services/

Anyone has VAD annotated Mandarin dataset?

Does anyone has Mandarin dataset with VAD annotation?

After how many iterations did your proposed VAD model converge ?

I used my own dataset to train the model but got error in Dimension

I have prepared my data similar to the data used in this model with 1s time duration for each file and 16000 for sample rate and define each sound file into a folder,

Now I am facing this error in building and training the model( InvalidArgumentError: Dimension -31998 must be >= 0
[[{{node zeros}}]] [Op:IteratorGetNext])

can you help me?

data format

I have not been able to understand the way training data should be specified.
Like how the labels should be written. Do we need to specify the time at which
labels occur in the sound file. If yes how and where

Training the net with smaller batch sizes

Hi,

I am a DSP engineer and acoustician. I'm new in the field of machine learning and neural networks and I am learning a lot working on your code :).
I am trying to understand the feasibility to squeeze your VAD graph architecture to be used in real-time. Right now I am able to retrain the neural net modifying a few parameters (mainly the frame window and overlap) using the D2 dataset in your paper. Although, it looks like the best ways to improve the graph computation are:

To use a less complex feature extractor other than the MRCG (I pretend to investigate this hypothesis later).
To retrain the neural net for smaller batch sizes with D2. In your paper your use 4096 which corresponds to several seconds of audio.

I 've tried to perform this training multiple times for different batch sizes (2048,1024,512,...) and I failed so far in this quest. Sometimes the accuracy of the training achieves a high value but it never generalizes for the test data.

I believe something is wrong on the training parameters and that's because I am contacting you for guidelines. Did you ever try to train the neural net for smaller batch sizes? Something should be changed on the net architecture or on the training parameters?

What is the recommended relative size of the audio to be tested over batch size? More clearly: may I test a graph trained with a large batch size with a very small audio sample? My intuition is that it is not possible. That's because I would like to retrain the net with a smaller batch size so that I could perform sequential tests with an audio buffer of reasonable size (let's say 500 ms).

Finally, do you have any recommendation for simplifying the network computational complexity without sacrificing too much in performance?

I appreciate your time in sharing your knowledge and experience,
Cheers,
Lucas

A question from the beginner

Hi,Kim,I'm just a new beginner,and i meet some problems when i run the code.
in the file "data_reader_DNN_v2", in next_batch, self._input_spec_list[self._num_file], when i run it ,it reveals a IndexError : list index out of range.
i think that it is because there is no .txt file in VAD/sample_data
i don't know if this is true，hope for your answer.

A more modern VAD with pre-trained models

I recently just searched the term "VAD" in github search and found many abandoned projects with some decent amount of traction, but mostly lacking pre-trained models.

So I decided to share our new pre-trained VAD:

Please see this vad - https://github.com/snakers4/silero-vad
Quality benchmarks here - https://github.com/snakers4/silero-vad#vad-quality-metrics
Overall comparison with webrtc - wiseman/py-webrtcvad#68

Hi I want to know why you multiply "TrueLabel_bin[iidx:iidx + wsize - 1]" by 10 in func Truelabel2Trueframe

`def Truelabel2Trueframe(TrueLabel_bin, wsize, wstep):
iidx = 0
Frame_iidx = 0
Frame_len = Frame_Length(TrueLabel_bin, wstep, wsize)
Detect = np.zeros([Frame_len, 1])
while 1:
if iidx + wsize <= len(TrueLabel_bin):
TrueLabel_frame = TrueLabel_bin[iidx:iidx + wsize - 1] * 10
else:
TrueLabel_frame = TrueLabel_bin[iidx:] * 10

    if (np.sum(TrueLabel_frame) >= wsize / 2):
        TrueLabel_frame = 1
    else:
        TrueLabel_frame = 0

    if (Frame_iidx >= len(Detect)):
        break

    Detect[Frame_iidx] = TrueLabel_frame
    iidx = iidx + wstep
    Frame_iidx = Frame_iidx + 1
    if (iidx > len(TrueLabel_bin)):
        break

return Detect`

I want to know why you multiply "TrueLabel_bin[iidx:iidx + wsize - 1]" by 10

Error while running main.py file

Traceback (most recent call last):
File "./lib/python/VAD_test.py", line 60, in
pred, label = graph_test.do_test(graph_list[-1], data_dir, norm_dir, data_len, is_default, mode)
File "/home/rajeev/pb/VAD/lib/python/graph_test.py", line 89, in do_test
valid_inputs, valid_labels = valid_data_set.next_batch(valid_batch_size)
File "/home/rajeev/pb/VAD/lib/python/data_reader_bDNN_v2.py", line 144, in next_batch
inputs = utils.bdnn_transform(inputs, self._w, self._u)
AttributeError: module 'lib.matlab_py.utils' has no attribute 'bdnn_transform'

utils file does not contain any function with name bdnn_transform.
Any help ??

Questions understanding bdnn_transform function.

Hi.
I'm trying to rewrite this project in C++ in search of better interoperability, better user friendliness and better performance.

Now I successfully implemented MRCG extraction and get a huge quality boost as well as a small memory usage. However I have some problem understanding the scripts that does the prediction. This script involves lots of array allocating and I want to know the purpose of every single line in order to write better implementation.

So, could you please kindly give an explanation of the bdnn_transform function?

def bdnn_transform(inputs, w, u):

    # """
    # :param inputs. shape = (batch_size, feature_size)
    # :param w : decide neighbors
    # :param u : decide neighbors
    # :return: trans_inputs. shape = (batch_size, feature_size*len(neighbors))
    # """

    neighbors_1 = np.arange(-w, -u, u)
    neighbors_2 = np.array([-1, 0, 1])
    neighbors_3 = np.arange(1+u, w+1, u)

    neighbors = np.concatenate((neighbors_1, neighbors_2, neighbors_3), axis=0)

    pad_size = 2*w + inputs.shape[0]
    pad_inputs = np.zeros((pad_size, inputs.shape[1]))
    pad_inputs[0:inputs.shape[0], :] = inputs

    trans_inputs = [
        np.roll(pad_inputs, -1*neighbors[i], axis=0)[0:inputs.shape[0], :]
                    for i in range(neighbors.shape[0])]

    trans_inputs = np.asarray(trans_inputs)
    trans_inputs = np.transpose(trans_inputs, [1, 0, 2])
    trans_inputs = np.reshape(trans_inputs, (trans_inputs.shape[0], -1))

    return trans_inputs

Thanks in advance.

ACAM always detect badly on the start of a corpus

As the title said, I found the corpus at the beginning always be detected as non-speech. Can you explain it?

lossing some data

Hi dear

I ran your reposity , but it shown some error .
I find there is no data under data/train/feat/ directory
can you help with me?

Thanks
weizhe

The deference of VAD_LSTM and VAD_LSTM2

Firstly, thanks for your contribution in VAD field! I have two question:

wether the train data of net input is the MRCG feature？
what is the deference of VAD_LSTM and VAD_LSTM2 ?
tks!

jtkim-kaist / vad Goto Github PK

vad's Introduction

Voice Activity Detection Toolkit

Update

2019-02-11

2018-12-11

2018-06-04

2018-04-09

Introduction

Acoustic feature extraction

Classifier

Prerequisites

Example

Post processing

Training

Recorded Dataset

Specification

TODO List

Trouble Shooting

Copyright

References

Acknowledgement

vad's People

Contributors

Stargazers

Watchers

Forkers

vad's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs