GithubHelp home page GithubHelp logo

retrocirce / zero_shot_audio_source_separation Goto Github PK

View Code? Open in Web Editor NEW
177.0 7.0 32.0 700 KB

The official code repo for "Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data", in AAAI 2022

Home Page: https://arxiv.org/abs/2112.07891

License: MIT License

Python 99.56% Shell 0.44%
audio-source-separation music-information-retrieval zero-shot-learning transformer-models query-based-learning python

zero_shot_audio_source_separation's Introduction

Zero Shot Audio Source Separation

Introduction

The Code Repository for "Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data", in AAAI 2022.

In this paper, we propose a three-component pipline that allows you to train a audio source separator to separate any source from the track. All you need is a mixture audio to separate, and a given source sample as a query. Then the model will separate your specified source from the track. Our model lies in a zero-shot setting because we never use the seapration dataset but a general audio dataset AudioSet. However, we achieve a very competible separation performance (SDR) in MUSDB18 Dataset compared with those supervised models. Our model has a generalization ability to unseen sources out of the training set. Indeed, we do not even require the separation dataset for training but solely AudioSet.

The demos and introduction are presented in our short instroduction video and full presentation video.

More demos will be presented in my personal website (now under construction)

Chckout this interactive demo at Replicate Thanks @ariel415el for creating this!

Model Arch

Main Separation Performance on MUSDB18 Dataset

We achieve a very competible separation performance (SDR) in MUSDB18 Dataset with neither seeing the MUSDB18 training data nor speficying source targets, compared with those supervised models.

Additionally, our model can easily separate many other sources, such as violin, harmonica, guitar, etc. (demos shown in the above video link)

MUSDB results

Getting Started

Install Requirments

pip install -r requirements.txt

Download and Processing Datasets

  • config.py
change the varible "dataset_path" to your audioset address
change the classes_num to 527
./create_index.sh # 
// remember to change the pathes in the script
// more information about this script is in https://github.com/qiuqiangkong/audioset_tagging_cnn

python main.py save_idc 
// count the number of samples in each class and save the npy files
python main.py musdb_process
// Notice that the training set is a highlight version, while the testing set is the full version

Set the Configuration File: config.py

The script config.py contains all configurations you need to assign to run your code.

Please read the introduction comments in the file and change your settings.

For the most important part:

If you want to train/test your model on AudioSet, you need to set:

dataset_path = "your processed audioset folder"
balanced_data = True
sample_rate = 32000
hop_size = 320 
classes_num = 527

Train and Evaluation

Train the sound event detection system ST-SED/HTS-AT

We further integrated this system ST-SED into an independent repository, and evaluteed it on more datasets, improved it a lot and achieved better performance.

You can follow this repo to train and evalute the sound event detection system ST-SED (or a more relevant name HTS-AT), the configuation file for training the model for this separation task should be htsat_config.py.

For this separation task, if you want to save time, you can also download the checkpoint directly.

Train, Evaluate and Inference the Seapration Model

All scripts is run by main.py:

Train: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py train

Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test

We recommend using at least 4 GPU cards with above 20GB memories per card. In our training phrase, we use 8 Nvidia V-100 (32GB) GPUs.

We provide a quick inference interface by:

CUDA_VISIBLE_DEVICES=1 python main.py inference

Where you can separate any given source from the track. You need to set the value of "inference_file" and "inference_query" in config.py. Just check the comment and get it started. And for the inference, we recommend to use only one card (because it is already enough).

Model Checkpoints:

We provide the model checkpoints in this link. Feel free to download and test it.

Citing

@inproceedings{zsasp-ke2022,
  author = {Ke Chen* and Xingjian Du* and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data},
  booktitle = {{AAAI} 2022}
}

@inproceedings{htsat-ke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle = {{ICASSP} 2022}
}

zero_shot_audio_source_separation's People

Contributors

ariel415el avatar diggerdu avatar dingjibang avatar retrocirce avatar satisfy256 avatar zfturbo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

zero_shot_audio_source_separation's Issues

problem with installation

Hi.
I encountered problems when installing, there might be issues with the requirements.txt.
I ran pip install -r requirements.txt on a conda environment and then got an error about user requesting numpy 1.22.0.
Numba 0.55.1, which is in the requirements, doesn't support 1.22.0.
So I installed numpy 1.21.5. When running main.py I got omegaconf error about unsupported value type.
Then I did pip install omegaconf -U and pip install hydra-core --upgrade and that fixed the issue.
I've been enjoying the program so far. The inferencing runs well on my cpu and I am able to do things that I wasn't able to with other source separation algorithms.

Poor results (and doubt)

Hello how are you?

Finally with your explanations I managed to use Zero Shot ASS,

However, when examining the results, using the test keys "vocals",

I got unattractive results, I didn't understand why, being that in the examples of the videos they sound very good,

So my question is as follows,

In the "Data/Query" folder, to get decent results, how many examples should I put there?

Because I did as follows,

In "mixture.wav" is the complete instrumental with the vocals,

And in the "Query" folder I put 15 examples of clean vocals (from the singer in question),

So am I doing it wrong?

NOTE: Each example is on average 7 to 15 seconds long in .wav format

I tested it on other examples and got the same pattern (unattractive results),

When I refer to unattractive results, I mean almost no final changes, the files in the "wavoutput" folder are practically unmodified.

I only tested the model (htsat_audioset_2048d.ckpt) because the other two models are incompatible with this code, I received checkpoint errors (because the script is not standardized to accept them).

So if you can let me know why I would be grateful,

My thanks in advance,

Lucas Rodrigues.

Quality not up to par with the demo maybe.

Hi!
First of all, I think this tool is amazing as it does help improve quality when used with other tools such as MDX
But, I noticed exports tend to have a bit of leak in them which doesn't seem to appear to be as bad as in the demo.
I have a few questions:

There are three checkpoint files and so far I know one of them only works for htsat config which is: htsat_audioset_2048d.ckpt
The other 2 I don't know what is the difference between them and where are they supposed to be used in the config file (zeroshot_asp_full.ckpt and zeroshot_asp_held_out.ckpt)

All predicted files (other, vocals, bass, drums) have the same outputs and I am not sure if this is because of the type of inference but changing "dataset_type" in htsat config does not seem to make any difference.

Are samples supposed to be only 10 seconds? is that possible to increase or if there is no point in doing so?

And last question what would you recommend setting up in configs for the best quality possible?

Thank you.

Different length of input and output

Hello. After applying model the size of output is slightly different from input. For example 9265664 became 9216000. It gives problem for validation on MUSDB dataset as well as creating inverse of extracted stem. What is the best way to align output to input. Right now I do the following:

    if audio.shape[-1] > vocals.shape[-1]:
        audio = audio[..., :vocals.shape[-1]]
    invert = audio - vocals

Reproducing paper results

Unfortunately I wasn't able to get close to paper results. My results on MUSDB18HQ dataset are following:

SDR bass: 1.0778
SDR drums: 2.5914
SDR other: 0.6353
SDR vocals: 2.9829

My method - I used as query 100 train tracks from MUSDB18HQ dataset. Then I created averaged 2048-vector for ['bass', 'drums', 'other' and 'vocals']. Then I split test tracks using this vector and calculated the SDR value. May be I missing something?

AssertionError: there should be a saved model when inferring

Hi how are you?

Fllowed the guidelines for using Zero Shot ASS but I am getting the following error when trying to use inference (same error occurs on google colab and on my local computer)

I already downloaded the checkpoint templates and already made the directory changes in the config.py file, but I still get this error, how can I resolve this issue? again thank you very much!

RETURN ERROR:
Screenshot_3

This is my already configured config.py file
Screenshot_4

ERROR: Unexpected bus error encountered in worker.

I get this error running the model on replicate for a 5 minute song (If I truncate the song to 1 minute it works.)

Any ideas?

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Doubt about increasing samples (Query)

Would it be possible to increase the examples in the Query folder? I noticed that when I put more than 74 .wav files I get this error: "ZeroDivisionError: integer division or module by zero".

So I would like to know if it is possible to increase the samples, thanks again.

How to get the remaining signal substracted

Hi Once the query sound is extracted how do i get the remaining subtracted signal, Does the model emit that as well or do I have to build a different model for extracting the resifual signal. IN typical demucs data sets you always have an Others stem which keeps a track of this. is there something like that in this project. I could not find anything the code, cna you help me out

Can't use overlap + clicks issue

"raise "ffmpeg does not exist. Install ffmpeg or set config.overlap_rate to zero."
TypeError: exceptions must derive from BaseException"

Yes, I have ffmpeg installed

Basically that's it. I tried overlap 1.0 but it doesn't seem to work, 0.0 works though.

I wanted to try it because the outputs always have periodic clicks in them

image

How to use

Hello how are you? would you be able to tell us how to use this tool, thank you very much!

Is it possible to have result sound kept for each individual sample ?

During my testing, I found out that 6 samples in one sample folder versus 6 individual separations give different results and I assume this is because of some kind of generalization (i don't know how this works :P). Also found this out one time when I put 500 instrument samples the output was basically NULL.

Is there any mode or a script in the code that could change it for example to include audio out of each sample extraction in one mixdown file or at least every sample having its output file it would make things so much easier for me instead of doing it manually because this method serves me better for things I'm testing.
(i tried gather mode, I don't know if that's the idea but output still doesn't seem to equal 6 individual sample inferences combined)
Thank you!

Export pred_.wav to stereo

Hello how are you? I'm really enjoying Zero Shot ASS,

I noticed that the files exported after the inference, are in WAV Mono format,

So my question is, would it be possible to somehow export the final result in stereo (2 channels)? If so, I imagine I should modify some lines in the config.py file, no?

Thanks again.

Colab notebook?

Are there any plans to create a Colab notebook for this model?

Code release

Hi, authors. When will the code be released?
Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.