hangzhaomit / sound-of-pixels Goto Github PK

View Code? Open in Web Editor NEW

363.0 15.0 74.0 1.27 MB

Codebase for ECCV18 "The Sound of Pixels"

Home Page: http://sound-of-pixels.csail.mit.edu

License: MIT License

Python 96.66% Shell 3.34%

cross-modality computer-vision sound-separation self-supervised-learning

sound-of-pixels's Introduction

Sound-of-Pixels

Codebase for ECCV18 "The Sound of Pixels".

*This repository is under construction, but the core parts are already there.

Environment

The code is developed under the following configurations.

Hardware: 1-4 GPUs (change [--num_gpus NUM_GPUS] accordingly)
Software: Ubuntu 16.04.3 LTS, CUDA>=8.0, Python>=3.5, PyTorch>=0.4.0

Training

Prepare video dataset.

a. Download MUSIC dataset from: https://github.com/roudimit/MUSIC_dataset

b. Download videos.

Preprocess videos. You can do it in your own way as long as the index files are similar.

a. Extract frames at 8fps and waveforms at 11025Hz from videos. We have following directory structure:

data
├── audio
|   ├── acoustic_guitar
│   |   ├── M3dekVSwNjY.mp3
│   |   ├── ...
│   ├── trumpet
│   |   ├── STKXyBGSGyE.mp3
│   |   ├── ...
│   ├── ...
|
└── frames
|   ├── acoustic_guitar
│   |   ├── M3dekVSwNjY.mp4
│   |   |   ├── 000001.jpg
│   |   |   ├── ...
│   |   ├── ...
│   ├── trumpet
│   |   ├── STKXyBGSGyE.mp4
│   |   |   ├── 000001.jpg
│   |   |   ├── ...
│   |   ├── ...
│   ├── ...

b. Make training/validation index files by running:

python scripts/create_index_files.py

It will create index files train.csv/val.csv with the following format:

./data/audio/acoustic_guitar/M3dekVSwNjY.mp3,./data/frames/acoustic_guitar/M3dekVSwNjY.mp4,1580
./data/audio/trumpet/STKXyBGSGyE.mp3,./data/frames/trumpet/STKXyBGSGyE.mp4,493

For each row, it stores the information: AUDIO_PATH,FRAMES_PATH,NUMBER_FRAMES

Train the default model.

./scripts/train_MUSIC.sh

During training, visualizations are saved in HTML format under ckpt/MODEL_ID/visualization/.

Evaluation

(Optional) Download our trained model weights for evaluation.

./scripts/download_trained_model.sh

Evaluate the trained model performance.

./scripts/eval_MUSIC.sh

Reference

If you use the code or dataset from the project, please cite:

    @InProceedings{Zhao_2018_ECCV,
        author = {Zhao, Hang and Gan, Chuang and Rouditchenko, Andrew and Vondrick, Carl and McDermott, Josh and Torralba, Antonio},
        title = {The Sound of Pixels},
        booktitle = {The European Conference on Computer Vision (ECCV)},
        month = {September},
        year = {2018}
    }

sound-of-pixels's People

Contributors

Stargazers

Watchers

sound-of-pixels's Issues

Calculate the evaluation index as zero

When I first calculated the evaluation index using an ideal binary mask, all the indices were zero. Through debugging, it is found that the predicted masks are all less than 0.5. I don't know how to solve this problem, or is this the first evaluation has not been trained, so the result is not good?

Dataset Structure

While downloading the duet videos, do we need to make a separate folder like xylophone flute or do we need to put the same video in two separate folders xylophone and flute?

Why the model does not go training?

Hello, I am a Chinese student.
I have pre-processed the dataset, and use the train_MUSIC.sh to train the default model.
But the result is not what I supposed. The metrics is all 0.
Even I directly use the eval_MUSIC.sh (I have downloaded the trained model), I also get the 0 metics(SDR ,SIR, .etc).
I don't change the code that you submit in github.
So how can I find what the problem is?

Poor visualizations, getting zero SDR, SIR, etc. on evaluation

I was trying to evaluate on 16 videos using downloaded trained model but I am unable to see the results in visualization. Video1 and video2 have only 3 frames each with no audio and predicted audio are also silent.

I'm getting the following output after evaluation:

Loading weights for net_frame
Loading weights for net_synthesizer
samples: 6300
samples: 16
1 Epoch = 196 iters
Evaluating at 0 epochs...
[Eval] iter 0, loss: 0.0115
[Eval Summary] Epoch: 0, Loss: 0.0115, SDR_mixture: 0.0000, SDR: 0.0000, SIR: 0.0000, SAR: 0.0000
Plotting html for visualization...
Evaluation Done!

Hope I would get some help
Thanks

A Question on Evaluation

Hello, I am a Chinese student.
I have downloaded two solo videos（2P83WJXifEs and 3d1b4UH43-E）from 'val.csv' to evaluate the performance of the model. Finally, loss is 0.5479. The effect of each speech separation is very unsatisfactory. Why is that? hope to get your reply.

P.s. I have download the trained model weights for evaluation by:
> ./ scripts / download_trained_model.sh
and I Evaluate the trained model performance by:
> ./ scripts / eval_MUSIC.sh

where is the pixelwise sound

Hi, I saw the func: forward_pixelwise in the code synthesizer, this is the one version of forward function that produce pixel-wise mask. However, throughout the code, and I found only the foward func is invoked but it is not the one of pixel-wise sound. Is there any demo that can produce pixel-wise sound?

Audio separation only

Can I use this only 'Audio Separation'?

Add requirements.txt

Download / pre-process data

Seeing from the issues, this is a common request... How do we download the video files using the .JSON file? How can we pre-process to extract the downloaded videos into the format :

Are there any scripts provided for these ? Thanks.

Cannot download the trained model

Hello. I have tried to download the trained model, but I failed to download the model by running the file 'download_trained_model.sh'. And I have also tried to access the website of the model "http://sound-of-pixels.csail.mit.edu/release/", but I got the reply "You don't have permission to access /release/ on this server.". So, I cannot get the trained model. How can I solve that problem?
Thanks a lot.

Failed to loading frames/audio

Sir, first i created .csv files, in the csv files it is showing what inputs are there and it's paths also. but during training it is showing failed to load frames/audio.

Downloading the videos from JSON file

In the JSON file mentioned, there are a number of youtube IDs. I wanted to ask, do we need to manually download them from the said file, or is there any other better way? Also, how to extract frames and the audio signal at the desired rates? I have never used JSON formats before, so please excuse my ignorance. Some guidance would be helpful.

About duet and mixtures video

I evaluate the trained model performance by the trained model weights u provided.
I find that the trained model use the Mix-and-Seperate process and finally restruct the two audios by inputing two solo videos,. This is a validation part.
And how about the Test part about duet video?
I am interested in research on sound source localization and separation of natural duo videos.
Should I train the model from scratch？ Or could I still use the trained model u provided？
Could u give me some suggestions please?
Thank u~ I'm looking forward to your reply.