GithubHelp home page GithubHelp logo

speaker-identification / you-only-speak-once Goto Github PK

View Code? Open in Web Editor NEW
136.0 5.0 34.0 13.31 MB

Deep Learning - one shot learning for speaker recognition using Filter Banks

Jupyter Notebook 97.78% Python 1.74% Dockerfile 0.03% JavaScript 0.26% CSS 0.05% HTML 0.14%
triplet-loss siamese-networks speaker-recognition voice-authentication neural-network one-shot-learning audio speech deep-speaker speaker-identification deep-learning

you-only-speak-once's Introduction

You Only Speak Once

One Shot Learning using Siamese Network for Speaker Recognition

Introduction

Biometric authentication systems rely on some kind of visual input, such as fingerprints and Face ID. However, devices like Alexa only receive audio as input that can be leveraged for authentication. Our project aims to develop a one shot learning based voice authentication system using siamese network. We intend to develop a neural speaker embedding system that maps utterances to a hyperspace where speaker similarity is measured by cosine similarity.

Related Work

One shot learning has been widely used for face recognition, but not as thoroughly explored for voice authentication, which is where our experimentations add some value. Our work is majorly motivated and aggregated from the following research: DeepFace, which uses siamese networks to compute embeddings for image classification FaceNet, which presents the concept of a triplet loss optimizing for distance between similar and dissimilar embeddings Deep Speaker, which combines the aforementioned techniques for speaker identification using filter banks as inputs to a ResNet

Experimental Setup

We used a subset of the LibriSpeech corpus that comprises of 100 hours of clean English speech recorded at 16kHz, with 250 unique speakers. We tested two different network designs for a voice authentication system.

SpectrogramNet:

Feature Extraction:

We split the audio samples into frames of 5 seconds using a stride of 4 seconds. For each of these frames, we extracted spectrograms of dimensions 227 x 227 x 1, which served as inputs into our Neural Network. Next we split the dataset into a train-set comprising of 200 speakers and a test-set with 50 speakers, with each speaker being represented by ~250 spectrograms.

alt text

Model Training:

We trained using a Siamese network (shown in the above figure) comprising of blocks of Convolution2D, ReflectionPad2D, Batch Normalization and a Fully Connected(FC) layer. Subsequently, we subtracted the absolute values of the outputs from two identical networks and passed it through another FC layer with cross-entropy loss and a 2-dimensional output, signifying a match and no match as the two labels. When we trained this network using Contrastive Loss, we had to rely on defining our own threshold to help distinguish between matching and non-matching audio inputs, in order to derive accuracy, a popular metric for such classification tasks. To mitigate this, cross-entropy was used which provided a nice way to depict accuracy. Also, based on empirical evidence, ReflectionPad2D seemed to outperform no-padding as well as simple padding approaches.

Results: The final version of this network used cross-entropy loss and the train test accuracies are plotted for 50 epochs in the adjacent figure. The first iteration of this Neural Network training resulted in a huge gap between the train and test accuracies warranting the introduction of dropout layers. However, dropout pushed both train and test accuracies to lower levels, although it did bring them closer to each other.

alt text

FBankNet:

Feature Extraction:

We split each audio file into frames of 25ms with a stride of 10ms. Given the small frame width, it's safe to assume that the signal would be constant and can be accurately transformed to the frequency domain. We calculated the first 64 FBank coefficients from each sample and grouped 64 frames together to generate a training sample of size 64 x 64 x 1 as inputs to Conv2D. Total number of samples obtained was more than half a million which was split into 95% and 5% for train and test respectively.

alt text

Model Training:

We used a CNN with residual blocks (shown in the figure above) comprising of 20890 parameters. Training was done in two stages: a multi class classification using Cross-Entropy loss followed by fine tuning using Triplet Loss. The network with a triplet loss layer expects three inputs: a random sample from the dataset called anchor, a sample from the same class as anchor called positive, and a sample from class other than the class of anchor, called negative. Mathematically, this loss is defined as

max(d(a,p) - d(a,n) + margin, 0)

, where d is the cosine distance.

alt text

Pre-training was performed using a batch size of 32 with varying learning rates tabulated above.

We used these pre-trained weights as the initialization for fine tuning using triplet loss with a margin of 0.2 and fine-tuned for 20 epochs with a learning rate of 0.0005. To compare any two samples, we had to come up with a distance threshold, below which these samples would be similar. During training we recorded anchor-positive (AP) and anchor-negative (AN) distances over all samples and observed that they were roughly gaussian.

d(AP) ~ N(0.39, 0.02)
d(AN) ~ N(0.93, 0.04)

We chose μAP + 3 𝜎AP to be a safe threshold as it was far enough from μAN.

Results:

We used two metrics to measure accuracy:

Positive accuracy = TP / (TP + FN)
Negative accuracy = TN / (TN + FP)

Since we’re building an authentication system, we are more cautious of false positives than false negatives, to ensure not allowing access to imposters. Hence our model is optimized to maximize negative accuracy. After fine tuning for 20 epochs, the model achieved ~97% negative accuracy and ~74% positive accuracy (see adjacent figure).

alt text

Conclusion

We successfully trained two distinct Siamese Networks with the aim of one shot learning. These networks were able to identify new users with a fairly high accuracy, and we exhibited this in the demos we ran with multiple users, with varying accents and styles of speaking, in a pretty noisy environment during the poster session. We have also been able to leverage multiple loss functions, such as triplet, contrastive and softmax. However, our training data could be augmented in ways to add background noise, in order to make the model more robust. Furthermore, we also aim to run deeper networks on larger datasets with various combinations of channels in the convolutional layers and add a more thorough hyper-parameter search.

Demo

Demo can be run by using docker.

$ docker build -t yoso .
$ docker run --rm -p 5000:5000 yoso

Navigate to http://localhost:5000 and follow instructions

you-only-speak-once's People

Contributors

iamlost127 avatar sinarj avatar vivekkr12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

you-only-speak-once's Issues

Paper?

Is this project based on a paper?

Share Jupyter notebook for fbank_net

The title is self explanatory! Could you share also the notebook you have used to calibrate the fbank_net ?
It seems (as per your notes it performs better)

Thanks
pb

how to focus on positive distances?

Thanks for the repo!

I have a question... you've said that "Hence our model is optimized to maximize negative accuracy".
Can you help me how to maximize positive accuracy in this your code?

Pre-trained model and the data

Hey,
Thx for your great job and sharing it.

Can I please have the pre-trained models, the data and info for how to use them for my purpose?
I do not have enough hardware to train a model like this.

ModuleNotFoundError: No module named 'numba.decorators'

sudo docker run --rm -p 5000:5000 yoso
 * Serving Flask app "demo/app.py"
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
Usage: flask run [OPTIONS]
Try 'flask run --help' for help.

Error: While importing "fbank_net.demo.app", an ImportError was raised:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/cli.py", line 240, in locate_app
    __import__(module_name)
  File "/fbank_net/demo/app.py", line 8, in <module>
    from .preprocessing import extract_fbanks
  File "/fbank_net/demo/preprocessing.py", line 1, in <module>
    import librosa
  File "/usr/local/lib/python3.6/site-packages/librosa/__init__.py", line 12, in <module>
    from . import core
  File "/usr/local/lib/python3.6/site-packages/librosa/core/__init__.py", line 125, in <module>
    from .time_frequency import *  # pylint: disable=wildcard-import
  File "/usr/local/lib/python3.6/site-packages/librosa/core/time_frequency.py", line 11, in <module>
    from ..util.exceptions import ParameterError
  File "/usr/local/lib/python3.6/site-packages/librosa/util/__init__.py", line 77, in <module>
    from .utils import *  # pylint: disable=wildcard-import
  File "/usr/local/lib/python3.6/site-packages/librosa/util/utils.py", line 15, in <module>
    from .decorators import deprecated
  File "/usr/local/lib/python3.6/site-packages/librosa/util/decorators.py", line 9, in <module>
    from numba.decorators import jit as optional_jit
ModuleNotFoundError: No module named 'numba.decorators'

Demo not accepting audio

Hi, great work and I’m looking forward to test this.
But I tried running the demo through docker and also on my machine but the app.js doesn’t seem to trigger microphone input in browser (Safari and Firefox).
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.