Contrastive Predictive Coding for Automatic Speaker Verification

License: MIT License

Python 56.59% Shell 18.87% Perl 24.54%

automatic-speaker-verification contrastive-loss predictive-modeling representation-learning unsupervised-learning

contrastive-predictive-coding-pytorch's People

Contributors

Stargazers

Watchers

Forkers

runngezhang entn-at yyht suyanzhou626 wangwen39 jrubin01 superwildboy ammieqi kachiem vyraun twistedmove visionu forks-learning matjele lclkent szrayic jac002020 lwyanne 15458wew pengbo-o franchuterivera mohamedr002 mariganeshkumar mil-hasegawa dl-alva pgsrv miaohanwen 5iding gordondoo weiqiao yangyutu shanjgit acenicks l1lang sakastlord grangerlue xiaoruishan mgsong chcbin csudragonzl wanc97 jinxcrazy danelee2601 zivzone chenyi0818 xhtian iwawiwi charlie-42 ankitshah009 swhan9873 wubo2180 ddoyoungkim tomato1mule elevatedesigner juneren self-supervised-contrastive-learning ladybuglady h770347 zhanfengdog sean0719 peterdarkdarkgogo fanld qianlivia pkulwj1994 zhangmengzxz zxp567 marcosiniscalchi jointentropy zylwithxy finerestaurant laozhanger kaen2891 russell-izadi-bose qinghaizheng1992 mjddh xuliu1993 b-dickson junweixu9 wanwano fzergz hongwen-sun stephlee12 nellyelsayed 479904tj lewis841214 aminsbh standardgalactic ghazalehran nguyentr17 netcalf lee-plus-plus luzhoushili aletheia88 senliontec oboii hawkiyc

contrastive-predictive-coding-pytorch's Issues

Train the spk_class.py

@jefflai108 Could you provide your spk2idx file when train the spk_class.py?

Can you provide the train & test dataset?

Thanks for your sharing of CPC code.
I read the code and found that the provided Dataset class reads .h5 files. From open ASR website and the information provided from the paper, I can only download those files with extension .frac or .txt.
Can you explicitly explain the configuration of your dataset?

Softmax uses by default dimension 1

In the calculation of the NCE loss, the softmax does not have a dimension to compute the result and by default, PyTorch uses dim=1 with 2D input.

The Loss in the paper highlights that the c_t (context) remains constant, and we 'match' this context to the actual values of z_t. By using dim=1 instead of dim=0 we actually compute the 'match' between a constant z_t and c_ts that are generated by each example in the batch.

The softmax should be performed on the columns of the 8x8 matrix to capture the true loss function defined in the CPC paper.

NCE should be calculated for each output of the GRU (and future positions) instead of only a random index

In the implementation, a random position of the sequence is used to compute the NCE loss. The paper mentions that the GRU output at each step is used to predict 12 timesteps in the future.

"The output of the GRU at every timestep is used as the context c from which we predict 12 timesteps in
the future using the contrastive loss"

Was this decision made to reduce training time?

Feed entire input to encoder??

I see in your implementation that you feed entire signal into the encoder,
while the paper has noted that each timestemp should be insert seperatly.
When you feed the entire signal into the encoder, you get some overlapping features with the Conv kernel (except for the case that the stride equal to the kernel size).

Why did you implement like that? do you think it does not matter ?

Thanks!

The implementation of loss might be wrong

https://arxiv.org/pdf/1807.03748.pdf
If you look at equation 4 from the paper, the log softmax would be over N-1 negative samples and 1 positive sample. From your implementation, the N-1 negative samples are actually self.time_step-1. Taking log_softmax over batch seems wrong. We switched it to log_softmax over time and training is more stable and accuracy has gone up for our toy dataset. However that is only a partial fix.

Threre might some wrong in validation.py

Hi , Thank you again for this share coding.
I found something might wrong in validation.py.
When you doing validation, initialing GRU hidden again, this might cause validation loss in log is more than itself. And since it intials GRU hideen every epoch, I think it might impair the performance slightly.

why torch.diag in nce loss?

NVM, I mis-read the equation. You are right.

Second last tilmestep as the c_t in the baseline model?

At Line 310, you have the following code

output, hidden = self.gru(forward_seq, hidden) # output size e.g. 8*100*256
c_t = output[:,t_samples,:].view(batch, 256) # c_t e.g. size 8*256

So you are using the second last timestep as c_t? Since the last timestep should be output[:,t_samples+1,:], or just simply hidden.

As far as I understand from the original paper, c_t should be the last timestep. Am I missing anything here?

Some Trouble in Understanding

I had some trouble to understand the realization of infoNCE loss function. I don't understand the How torch.diag() could represent infoNCE loss.

NCE accuracy calculation

Hello @jefflai108,

I think the accuracy also should have its denominator as batch*self.timestep at

Contrastive-Predictive-Coding-PyTorch/src/model/model.py

Line 320 in a9dab4e

accuracy = 1.*correct.item()/batch

Possibly for other models too although I did not check them.

how to format h5 files for input?

Hi, thanks for sharing your implementation of CPC. I've been trying to run it out of the box but am having issues shaping the input data correctly. Is there another script that encodes the wav file directories into .h5?

Train and test data not available

Dear Jeff,

Thank you so much for providing this great repository! Sincerely appreciate your great implementation!

However, after reading all the closed issues and trying out for initializing the training, I am still a bit confused about the training and test dataset. I try to run run.sh and the following error reported:

May I request what might be the possible solution of this? Thank you so much for your clarification!

Sincerely,
Martin

Use of Batch Normalisation

The paper does not mention the use of Batch Normalization in the case of the audio task.

In the case of the Vision task, it mentions that '' We did not use Batch-Norm [38]."

May I ask which version of torch do you use?

How combine MFCC and CPCfeatures

Thank you for sharing your code, I have meet some problem.
When we use CPC, it is [128,256] but mfcc is [frame,39]，
as you result， I wonder how to combine it in [frame, 39 + 256] dims.
Thanks again

the self.softmax() don't update the paras in the CDCK2?

I don't understand how to update the self.softmax() when training, because the acc don't backward. How does it work?

Code question in src/model/model.py#L113

In model.py line 113 : output2, hidden1 = self.gru2(forward_seq, hidden1)
perhaps it should be : output2, hidden2 = self.gru2(forward_seq, hidden1) ?

Hey should 'correct' only be if the final prediction was correct? Same for accuracy?

Contrastive-Predictive-Coding-PyTorch/src/model/model.py

Line 100 in a9dab4e

 correct1 = torch.sum(torch.eq(torch.argmax(self.softmax(total), dim=0), torch.arange(0, batch))) # correct is a tensor 

Meaning, should that correct variable be += ?

About negative sample?

Contrastive-Predictive-Coding-PyTorch/src/model/model.py

Line 100 in a9dab4e

 correct1 = torch.sum(torch.eq(torch.argmax(self.softmax(total), dim=0), torch.arange(0, batch))) # correct is a tensor 

@jefflai108 Is negative sample from other batch at the same t (time-step) ?

What is the format of "list" file?

I saw list files such as "LibriSpeech/list/train.txt" are required parameters for main.py. It seems such files are not provided by librispeech officially. What is the format of them? Could you provide them or the script to generate them?

jefflai108 / contrastive-predictive-coding-pytorch Goto Github PK

contrastive-predictive-coding-pytorch's People

Contributors

Stargazers

Watchers

Forkers

contrastive-predictive-coding-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs