GithubHelp home page GithubHelp logo

cone-of-silence's Introduction

The Cone of Silence: Speech Separation by Localization

alt text

Authors

Teerapat Jenrungrot*, Vivek Jayaram*, Steve Seitz, and Ira Kemelmacher-Shlizerman
*Co-First Authors
University of Washington

Video and audio demos are available at the project page

34th Conference on Neural Information Processing Systems, NeurIPS 2020. (Oral)

Blog Post - Coming Soon

Summary

Our method performs source separation and localization for human speakers. Key features include handling an arbitary number of speakers and moving speakers with a single network. This code allows you to run and evaluate our method on synthetically rendered data. If you have a multi-microphone array, you can also obtain real results like the ones in our demo video.

Getting Started

Clone the repository:

git clone https://github.com/vivjay30/Cone-of-Silence.git
cd Cone-of-Silence
export PYTHONPATH=$PYTHONPATH:`pwd`

Make sure all the requirements in the requirements.txt are installed. We tested the code with torch 1.3.0, librosa 0.7.0 and cuda 10.0

Download Pretrained Models: Here. If you're working in a command-line environment, we recommend using gdown to download the checkpoint files.

cd checkpoints 
gdown --id 1OcLxp0s_TN78iKaFrLAqjIoTKeOTUgKw  # Download realdata_4mics_.03231m_44100kHz.pt
gdown --id 18dpUnng_8ZUlDrQsg5VymypFnFlQBPIp  # Download synthetic_6mics_.0725m_44100kHz.pt

Quickstart: Running on Real Data

You can easily produce results like those in our demo videos. Our pre-trained real models work with the 4 mic Seed ReSpeaker MicArray v 2.0. We even provide a sample 4 channel file for you to run Here. When you capture the data, it must be a m channel recording. Run the full command like below. For moving sources, reduce the duration flag to 1.5 and add --moving to stop the search at a coarse window.

python cos/inference/separation_by_localization.py \
    /path/to/model.pt \
    /path/to/input_file.wav \
    outputs/some_dirname/ \
    --n_channels 4 \
    --sr 44100 \
    --mic_radius .03231 \
    --use_cuda

Rendering Synthetic Spatial Data

For training and evaluation, we use synthetically rendered spatial data. We place the voices in a virtual room and render the arrival times, level differences, and reverb using pyroomacoustics. We used the VCTK dataset but any voice dataset would work. An example command is below

python cos/generate_dataset.py \
    /path/to/VCTK/data \
    ./outputs/somename \
    --input_background_path any_bg_audio.wav \
    --n_voices 2 \
    --n_outputs 1000 \
    --mic_radius {radius} \
    --n_mics {M}

Training on Synthetic Data

Below is an example command to train on the rendered data. You need to replace the training and testing dirs with the path to the generated datasets from above. We highly recommend initializing with a pre-trained model (even if the number of mics is different) and not training from scratch.

python cos/training/train.py \
   ./generated/train_dir \
   ./generated/test_dir \
   --name experiment_name \
   --checkpoints_dir ./checkpoints \
   --pretrain_path ./path/to/pretrained.pt \
   --batch_size 8 \
   --mic_radius {radius} \
   --n_mics {M} \
   --use_cuda

Note: The training code expects you to have sox installed. The easiest way to install is to install it using conda as follows: conda install -c conda-forge -y sox.

Training on Real Data

For those looking to improve on the pretrained models, we recommend gathering a lot more real data. We did not have the ability to gather very accurately positioned real data in a proper sound chamber. By training with a lot more real data, the results will almost certainly improve. All you have to do is create synthetic composites of speakers in the same format as the synthetic data, and run the same training script.

Evaluation

For the synthetic data and evaluation, we use a setup of 6 mics in a circle of radius 7.25 cm. The following is instructions to obtain results on mixtures of N voices and no backgrounds. First generate a synthetic datset with the microphone setup specified previous with --n_voices 8 from the test set of VCTK. Then run the following script:

python cos/inference/evaluate_synthetic.py \
    /path/to/rendered_data/ \
    /path/to/model.pt \
    --n_channels 6 \
    --mic_radius .0725 \
    --sr 44100 \
    --use_cuda \
    --n_workers 1 \
    --n_voices {N}

Add --prec_recall separately to get the precision and recall.

Number of Speakers N 2 3 4 5 6 7 8
Median SI-SDRi (dB) 13.9 13.2 12.2 10.8 9.1 7.2 6.3
Median Angular Error 2.0 2.3 2.7 3.5 4.4 5.2 6.3
Precision 0.947 0.936 0.897 0.912 0.932 0.936 0.966
Recall 0.979 0.972 0.915 0.898 0.859 0825 0.785

cone-of-silence's People

Contributors

mjenrungrot avatar vivjay30 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cone-of-silence's Issues

The effect of model size on the overall performance

Hi Vivek,

Thanks for opensourcing this interesting project - nice work! I just have one question about the model size and the performance. I checked your Demucs model implementation and calculated the number of parameters, and with your default hyperparameter setting there are over 260M parameters. I'm not sure if this is the actual setting you used for training, but if so, this is really a huge amount of parameters as in other separation models the model sizes are typically smaller than 10M nowadays. I'm wondering whether you have done any experiments on how the performance will be if you shrink the model size, e.g. to the level of multi-channel Conv-TasNet or TAC reported in your paper. Thanks!

Two-channel audio recordings?

From what I understand, the model will work for 4-channel and 6-channel wav files.

Does this model also work on 2-channel recordings? Is there a pretrained model for that?

Sample 4-channel audio files?

Hi, in your README, you mention that "We even provide a sample 4 channel file for you to run". Where is this file located?

real-time?

Hi,

thanks for this library.
Is it possible to use this in a real-time scenario like conferences?

Best regards,
Dirk

No directory 'mir_eval'

Hi, I'm trying to recreate your results, when I run the code I get error message saying there's no such module 'mir_eval', and I wasn't able to find it. perhaps you moved the files somewhere else or renamed the directory? the reference is from 'cos/helpers/eval_utils.py'

from mir_eval.separation import bss_eval_sources

Thanks

Real-time

Hi. I do realize that you have already answered the question that whether the algorithm is real-time or not in #9 . I just want to know whether there is any way that we can do to make it operational in real-time?

[bug] Duplicate un-normalize in train.py

Hi, vivjay

# Un-normalize
output_signal = output_signal * stds.unsqueeze(
3) + means.unsqueeze(3)
# Un-normalize
output_signal = unnormalize_input(output_signal, means, stds)
output_voices = output_signal[:, 0]
loss = model.loss(output_voices, label_voice_signals)

line 117 and line 121 are the same.
Un-normalize twice will make the test loss (validation) do not decrease.
much like the problem @zhangshengoo asked this #12 (comment)

Mean and STD of the signal peak

Hi Vivek,

Thanks for your awesome work.
What's the meaning for FG_VOL_MIN、FG_VOL_MAX、BG_VOL_MIN、BG_VOL_MAX in generate_dataset.py and how did you calculate these four values?

Best regards,
KenHuang

dataset used for COS

Hey, I loved your work, I was trying to replicate it, to do that I was generating some synthetic dataset but got some errors and doubts.
As you mentioned the dataset used is VCTK, it has dataset in .flac format which is not recognized by the program, so did you guys did any preprocessing over the dataset?
And there is no data folder in the original dataset i,e. VCTK (mentioned in the command to generate synthetic dataset).
And can you share the dataset you guys used for training?

Real Data

Hi,

Your paper mentions you recorded 3 hours of data into the Seeed 4 mic hat from VCTK corpus.

Would you be willing to make that available?
We'd like to duplicate the results, and we don't get the same level from training on just the synthetic data.

Thanks!!
Richard

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.