GithubHelp home page GithubHelp logo

philipperemy / deep-speaker Goto Github PK

View Code? Open in Web Editor NEW
895.0 895.0 239.0 81.52 MB

Deep Speaker: an End-to-End Neural Speaker Embedding System.

License: MIT License

Shell 3.22% Python 96.78%
deep-learning deep-speaker keras tensorflow

deep-speaker's Introduction

Solving Artificial Intelligence one step at a time 👋

Are you an individual / company willing to invest in open source? Become a sponsor!

deep-speaker's People

Contributors

daniel-sc avatar linhdvu14 avatar mshenron avatar paulo-raca avatar philipperemy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-speaker's Issues

AssertionError

hello,when I run python cli.py --unseen_speakers p363,p364 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR, it gives me an error

$ python cli.py --unseen_speakers p363,p364 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR
2019-10-24 21:13:18,326 - INFO - audio_dir = /Users/obsidian/deep-speaker-data/VCTK-Corpus/
2019-10-24 21:13:18,326 - INFO - cache_dir = /Users/obsidian/deep-speaker-data/cache/
2019-10-24 21:13:18,326 - INFO - sample_rate = 8000
Using TensorFlow backend.
Traceback (most recent call last):
  File "cli.py", line 83, in <module>
    main()
  File "cli.py", line 71, in main
    inference_unseen_speakers(audio_reader, unseen_speakers[0], unseen_speakers[1])
  File "/Users/obsidian/source/deep-speaker/unseen_speakers.py", line 33, in inference_unseen_speakers
    sp1_feat = generate_features_for_unseen_speakers(audio_reader, target_speaker=sp1)
  File "/Users/obsidian/source/deep-speaker/unseen_speakers.py", line 22, in generate_features_for_unseen_speakers
    assert target_speaker in audio_reader.all_speaker_ids
AssertionError

How can I solve this problem?
Thank you!

Did you observe that the loss was stuck at the margin?

Hi, as mentioned in the title, I am also implementing deep speaker, using the same data as you used. I am wondering was your triplet loss able to get smaller than the margin value? It seems mine was stuck at the margin.

GPU usage rate

when i run train_triplet phase of this project with the enviroment ubuntu18.4 and Tesla gpu, the nvidia-smi shows that GPU usage is only 305MiB ,i wonder how to set the gpu usage rate options in the code,thanks

How to Test data?

How can i Test data from these files of the branch Improvements one?

Training_loss = NaN of deep speaker!

While training the model, during 231st batch, training loss = 0.10000002384185791, but next batch onwards training loss = NaN. What could be possible reason for this? What is the fix for this problem. It doesn't seem to me as problem of divergence, since training loss has abruptly changed from 0.1 to NaN. Some clarifications please.

How can I make prediction for one audio file.

From your source code, it seems the file to predict should also be packed into batches the same size as when training. What if I have just one file and I wanna make the vector for it alone? How shall it be done. Thank you.

Performance of pretrained model

I'm working on testing the pretrained model on some public datasets. But the coding require to put every piece of audio into cache and I have to modify the code to read directly from audio file. So before doing the modification, I'd like to know the performance of the pretrianed model. If it is bad, then I would train from scratch before testing or maybe change to another approach.

Thanks.

Silence / Background Noise similarity

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

GPU utlization low

So i decided to go through your code which i am enjoying since yesterday! My intention is to try to help improve it and align it to the original paper.
I noticed that the utilization on my GPU is quite low close to 0% whereas my memory peaks to 8GB and i get all the relevant GPU messages from the tensorflow backend.
I also notice your comments that the code is not GPU efficient.
As you are more comfortable with the code do you think this is because when training for each "epoch" you dont really use batches of positive and negative speakers ?
Any ideas how to fix this ?

Thanks !
pb

how can extract embedding from the wav?

I want to get embedding from the raw wav file which the speaker has been trained. If I remove the cache and run a cache update, then compute the embedding, the result is diffent from directly run get_embeddings command. So which one is right? And when get embedding from raw wav which not trained, the cache file need to delete every time?

how to use gpu

I have install tensorflow-gpu.But when run './deepspeaker train_softmax',I find it only use cpu.
I don't know why. Can you give me some points.

'''
pip list | grep tensorflow
tensorflow-estimator 2.2.0
tensorflow-gpu 2.2.0
'''

The test results

Hello, after training, I used the checkpoint with the best training effect. According to the test sample in readme.md, I changed it slightly and tested 100 speakers outside the training set for cross-validation. Resulting EER has reached ~23% ! Do you have any thoughts on this, is it not suitable for testing data outside the training set?

counts_per_speaker

Could you tell me about --counts_per_speaker? I want to train on my own dataset, but I don't know how to use it.

Speed-up the speaker verification somehow? (Features extraction mainly)

Is there any way to compute features faster? It takes around 15 minutes per speaker, or even more. Do I do anything wrong? I adapted the code, so I first make the cache, then compute features, save them to the file and only afterwards calculate the cosine simularity. But saving the features takes nearly 20 minutes per cache on i7-7700.
Looks like the model learns during the features extraction process. I think the file should only be feed into NN without learning, and near to the finil layer the featrures are being extracted. It is different here?

Length of audio

Hi,

I'm trying to use your model to create a real-time voice identification system.

Correct me if I'm wrong but when you convert the audio into mfcc, you use the whole audio to construct the mfcc and then randomly sample a size of NUM_FRAMES

I'm now onto investigating which sample size I should convert to pass into the mfcc fbank conversion. I haven't done extensive testing yet but upon initial trial, 50,000 frames passed onto the fbank() function works well.
This figure was pretty much a shot in the dark.

Would you have any advice as to the minimum required audio length?

Tensorflow and keras version

Which version of tensorflow and keras are you using??
I am getting this error while executing your code:
Traceback (most recent call last):
File "train_cli.py", line 245, in
start_training()
File "train_cli.py", line 207, in start_training
kx_train, ky_train, kx_test, ky_test, categorical_speakers = data_to_keras(data)
File "/home/deepankar/deep-speaker/utils.py", line 15, in data_to_keras
categorical_speakers = SpeakersToCategorical(data)
File "/home/deepankar/deep-speaker/utils.py", line 169, in init
self.speaker_categories = to_categorical(self.int_speaker_ids, num_classes=len(self.speaker_ids), dtype='float32')
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/keras/utils/np_utils.py", line 31, in to_categorical
num_classes = np.max(y) + 1
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
initial=initial)
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

wav from the same speaker

when test with two different wavs which from the same unseen speaker ,the result is almost equal to the result which create by different unseen speaker.
can you tell me why?

reference

hello,I run this project with my own datasets.but get not good result.
for same speaker(same wav have been trained) the Cosine = 0.0
for same speaker(different wav) the Cosine = xe-06 or xe-07
for the different speaker the Cosine =X*e-05
there seems not much different from them ,can you give me some suggestions?

FileNotFoundError

[Errno 2] No such file or directory: '$CACHE_DIR\audio_cache_pkl\VCTK Corpus\wav48\p240\p240_034_cache.pkl'

Trained model produces near-identical outputs for different inputs?

As title. I noticed the trained model produces near-identical embeddings for different input audios (even randomly generated values). Is anyone having this issue?

If it helps, I'm training with

  • LibriSpeech dev-clean dataset: http://www.openslr.org/12/
  • SAMPLE_RATE = 16000
  • BATCH_NUM_TRIPLETS = 21
  • TRUNCATE_SOUND_FIRST_SECONDS = 3
  • Loss = 0.10225 when I stopped training

I've tried a few different datasets/settings and had the same results.

Probably related to #9 . Somehow the model decides the best way to go is make all embeddings the same, hence loss = margin...

cli.py error

python cli.py --unseen_speakers p363,p363 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR
2019-03-08 17:32:52,348 - INFO - audio_dir = /home/deep-speaker-data/VCTK-Corpus/
2019-03-08 17:32:52,348 - INFO - cache_dir = /home/deep-speaker-data/cache/
2019-03-08 17:32:52,348 - INFO - sample_rate = 8000
Using TensorFlow backend.
Traceback (most recent call last):
File "cli.py", line 83, in
main()
File "cli.py", line 71, in main
inference_unseen_speakers(audio_reader, unseen_speakers[0], unseen_speakers[1])
File "/home/deep-speaker/unseen_speakers.py", line 33, in inference_unseen_speakers
sp1_feat = generate_features_for_unseen_speakers(audio_reader, target_speaker=sp1)
File "/home/deep-speaker/unseen_speakers.py", line 22, in generate_features_for_unseen_speakers
assert target_speaker in audio_reader.all_speaker_ids
AssertionError

missing pkl files for testing

Hi @philipperemy ,

Thanks you for the implementation. Can you help us with the files - speaker-change-detection-norm.pkl and speaker-change-detection-categorical_speakers.pkl. I couldn't find this resource on repo. Also, can you guide how the trained model can be used for inference?

Cheers!

Where is CNN ?

I have checked the paper and the code and I realized that there is no implementation for the Convolution neural networks stack such as ResNet inside the code.

Problem when using the model for prediction

Hello,

I am having problem to use a trained model to perform prediction. I was capable of training using your code, but I'm having problems with the shape of the input to feed to the model.predict.

Can someone help me?

Question in normalization step in prepare MFCC step.

Hi Philip,

I have a two questions about the way your prepare the MFCC feature (audio.py).

  1. In the following code, you normalized the raw Fbank features (num_frames, nfit ).
 filter_banks, energies = fbank(signal, samplerate=sample_rate, nfilt=NUM_FBANKS)
 frames_features = normalize_frames(filter_banks)
...
def normalize_frames(m, epsilon=1e-12):
    return [(v - np.mean(v)) / max(np.std(v), epsilon) for v in m]

But I think your code works in normalizing the 26D Fbank features within just each frame. It is actually the instance normalization but not the commonly-used batch normalization. If we want to normalize data at the batch (or the training data) level, I believe we should do like

return [ (v - np.mean(m, axis=0)) / np.std(m, axis=0) for v in m]] or

from sklearn.preprocessing import StandardScaler
s1 =StandardScaler()
return s1.fit_transform(m)

Could you please explain why you normalize data in this way?

  1. Could you explain why the function name is read_mfcc() but you actually choose to used the Fbank features which lack the DCT from MFCC features.

Thanks you!

devector

if we change the embeddings produced by softmax into dvector embeddings which created by pytorch-speaker-verification. can it works?

Adaptation to new language

Hi,
Thanks for such a great work,
I wonder if this model (trained on English) can perform well on a different language?

speaker embedding

@philipperemy I want to know why cosine similarity is so close when I run python cli.py --get_embeddings xxx --cache_output_dir $CACHE_DIR --audio_dir $AUDIO_DIR to get two different speakers embedding?It is ~0.97, and I don't understand how to slove it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.