GithubHelp home page GithubHelp logo

scaler_gan's Introduction

ScalerGAN : Speech Time-Scale Modification With GANs

Eyal Cohen ([email protected])
Felix Kreuk ([email protected])
Joseph Keshet ([email protected])

ScalerGAN is a software package for Time-Scale Modification (AKA speed-up or slow-down) of a given recording using a novel unsupervised learning algorithm.

The model was present in the paper Speech Time-Scale Modification With GANs and the paper was published in IEEE Signal Processing Letters.

Audio examples can be found here.

If you use this code, please cite the following paper:

@article{cohen2022scalergan,
  author={Cohen, Eyal and Kreuk, Felix and Keshet, Joseph},
  journal={IEEE Signal Processing Letters},
  title={Speech Time-Scale Modification With GANs},
  year={2022},
  volume={29},
  pages={1067-1071},
  doi={10.1109/LSP.2022.3164361}}
  publisher={IEEE}
}

Pre-requisites:

  1. Python 3.8.5+
  2. Clone this repository.
  3. Install python requirements. Please refer requirements.txt.
  4. Download and extract the LJ Speech dataset.
  5. Create an input.txt file with the path to the audio files in the dataset.
    One can use the following command to create the input.txt file from the dataset:
$ ls <PATH/TO/LJ_dataset/DIR> | xargs realpath > ./data/input.txt

Or modify the data/input.txt to point to the dataset files.

Training:

Quick Training (using default parameters):

$ python ScalerGAN/train.py

See configs.py for more options.

Single GPU Training:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt> --output_dir <PATH/TO/OUTPUT/DIR> --device cuda

Distributed training (multiple GPUs) using torch.distributed.launch:

$ python -m torch.distributed.launch --nproc_per_node=N train.py --device cuda --name multi_gpu --distributed --distributed_backend='nccl'

Modify N value in the nproc_per_node argument to match the number of GPUs available.

Resume training from checkpoint:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt>  --device cuda --resume <PATH/TO/CHECKPOINT>

Train with manual min and max time-scale factors:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt>  --device cuda --max_scale 0.5 --min_scale 2.0

Inference:

One can modify the inference file data/inference.txt to the desired audio files or use '--inference_file' to specify infrenece txt file. The artifacts will be saved in the output directory.

Quick Inference with mel-spectrogram output:

$ python ScalerGAN/inference.py --device cuda

Quick Inference with mel-spectrogram and audio output:

$ python ScalerGAN/inference.py --device cuda --infer_hifi

The flag '--infer_scales' can be used to specify the scales for inference. If not specified, the model will infer the default scales [0.5, 0.7, 0.9, 1.1, 1.3, 1.5].

Inference with specific checkpoint:

$ python ScalerGAN/inference.py --checkpoint_path <PATH/TO/CHECKPOINT> --device cuda

Fine Tuning:

For fine tuning the ScalerGAN model on a different dataset, one can use the following command:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt>  --device cuda --fine_tune --checkpoint_path <PATH/TO/CHECKPOINT>

For fine tuning the HiFi-GAN Vocoder do the following:

  1. Run ScalerGAN inference on the desired dataset.
$ python ScalerGAN/inference.py --input_file <PATH/TO/INPUT.txt>  --device cuda
  • The cropped audio and the generated Mel spectrograms will be saved in the output directory, copy them to desired location by the HiFi-GAN Vocoder.
  1. Follow the instructions in the HiFi-GAN repository for fine tuning the model.
  2. Use the generated HiFi-GAN checkpoint and config for inference with ScalerGAN, using the flag '--hifi_config' and '--hifi_checkpoint':
$ python ScalerGAN/inference.py --checkpoint_path <PATH/TO/SCALER_GAN/CHECKPOINT> --device cuda --infer_hifi --hifi_config <PATH/TO/HIFI_CONFIG> --hifi_checkpoint <PATH/TO/HIFI_CHECKPOINT>

Please Note

  • Prediction/ Training is significantly faster if you run on GPU.
  • The model is trained on 22.05kHz audio files. If you run on a different sampling rate, the model will not work as expected.
  • The model crops the audio files to be divisible by '--must_divide' to fit the model architecture. The default value is 8.
  • The model is trained on 80 Mel bins. If you want to change it, you must change it in both the training and inference scripts.

Acknowledgements

scaler_gan's People

Contributors

eyalcohen308 avatar arvids avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.