GithubHelp home page GithubHelp logo

fmzzq / voicefilter-1 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jain-abhinav02/voicefilter

0.0 0.0 0.0 3.33 MB

Unofficial Keras implementation of Google AI VoiceFilter

Jupyter Notebook 98.96% Python 1.04%

voicefilter-1's Introduction

Voice Filter

This is a Tensorflow/Keras implementation of Google AI VoiceFilter.

Our work is inspired from the the academic paper : https://arxiv.org/abs/1810.04826

The implementation is based on the work : https://github.com/mindslab-ai/voicefilter


Team Members

  1. Angshuman Saikia

  2. Abhinav Jain

  3. Yashwardhan Gautam


Introduction

We intend to improve the accuracy of Automatic speech recognition(ASR) by separating the speech of the primary speaker. This project has immense application in chatbots, voice assistants, video conferencing.


Who is our primary speaker ?

All users of a service will have to record their voice print during enrolment. The voice print associated with the account is used to identify the primary speaker.

How is voice print recorded ?

A audio clip is processed by a separately trained deep neural network to generate a speaker discriminative embedding. As a result, all speakers are represented by a vector of length 256.


How to prepare Dataset ?

We use the publicly available speech dataset - Librispeech. We select a primary and secondary speaker at random. For the primary speaker, select a random speech for reference and a random speech for input. Select a random speech of the secondary speaker. The input speeches of primary and secondary users are mixed which serves as one of the input. The reference speech is passed through a pre trained model ( Source: https://github.com/mindslab-ai/voicefilter ) to create an embedding which serves as the other input. The output is the input speech of the primary speaker. The speeches are not used directly. Instead, they are converted into magnitude spectrogram before being fed into a deep neural network. We have used python's librosa library to perform all audio related functions.

We created a dataset of 29351 samples that have been divided into 8 parts for ease of use with limited RAM. Link to the kaggle dataset: https://www.kaggle.com/abhinavjain02/speech-separation


Stats on Prepared Data

It took around 11 hours to prepare the dataset on Google Colab. The code is present in the dataset folder.

Note: All ordered pairs of primary and secondary speakers are unique

Stat/Dataset Train Dev Test
Total no. of unique speeches available in LibriSpeech Dataset 28539 2703 2620
No. of unique speeches used 26869 1878 1838
Percentage of total speeches used 94.15 % 69.48 % 70.15 %
Total no. of samples prepared 29351 934 964
No. of samples with same primary and reference speech 376 (1.28 %) 10 (1.07 %) 11 (1.14 %)

Proposed System Architecture


Requirements

  • This code was tested on Python 3.6.9 with Google Colab.

    Other packages can be installed by:

    pip install -r requirements.txt
    

Model


The model architecture is precisely as per the academic paper mentioned above. The model takes a input spectrogram and d vector(embedding) as input and produces a soft mask which when superimposed on the input spectrogram produces the output spectrogram. The output spectrogram is combined with the input phase to re create the primary speakers audio from the mixed input speech.

Loss Function Optimizer Metrics
Mean Squared Error (MSE) adam Sound to Distortion Ratio(SDR)


Training

  • The model was trained on Google Colab for 30 epochs.
  • Training took about 37 hours on NVIDIA Tesla P100 GPU.

Results

  • Loss

  • Validation SDR

  • Test

Note: The following results are based on model weights after 29th epoch( Peak SDR on validation )

Loss SDR
0.0104 5.3250

Audio Samples


Key learnings:

  • Processing Audio data using librosa
  • Creating flexible architechtures using Keras functional API
  • Using custom generator in keras
  • Using custom callbacks in keras
  • Multi-Processing in python

App Snippet

voicefilter-1's People

Contributors

jain-abhinav02 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.