ScalerGAN : Speech Time-Scale Modification With GANs

Eyal Cohen ([email protected])
Felix Kreuk ([email protected])
Joseph Keshet ([email protected])

ScalerGAN is a software package for Time-Scale Modification (AKA speed-up or slow-down) of a given recording using a novel unsupervised learning algorithm.

The model was present in the paper Speech Time-Scale Modification With GANs and the paper was published in IEEE Signal Processing Letters.

Audio examples can be found here.

If you use this code, please cite the following paper:

@article{cohen2022scalergan,
  author={Cohen, Eyal and Kreuk, Felix and Keshet, Joseph},
  journal={IEEE Signal Processing Letters},
  title={Speech Time-Scale Modification With GANs},
  year={2022},
  volume={29},
  pages={1067-1071},
  doi={10.1109/LSP.2022.3164361}}
  publisher={IEEE}
}

Pre-requisites:

Python 3.8.5+
Clone this repository.
Install python requirements. Please refer requirements.txt.
Download and extract the LJ Speech dataset.
Create an input.txt file with the path to the audio files in the dataset.
One can use the following command to create the input.txt file from the dataset:

$ ls <PATH/TO/LJ_dataset/DIR> | xargs realpath > ./data/input.txt

Or modify the data/input.txt to point to the dataset files.

Training:

Quick Training (using default parameters):

$ python ScalerGAN/train.py

See configs.py for more options.

Single GPU Training:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt> --output_dir <PATH/TO/OUTPUT/DIR> --device cuda

Distributed training (multiple GPUs) using torch.distributed.launch:

$ python -m torch.distributed.launch --nproc_per_node=N train.py --device cuda --name multi_gpu --distributed --distributed_backend='nccl'

Modify N value in the nproc_per_node argument to match the number of GPUs available.

Resume training from checkpoint:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt>  --device cuda --resume <PATH/TO/CHECKPOINT>

Train with manual min and max time-scale factors:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt>  --device cuda --max_scale 0.5 --min_scale 2.0

Inference:

One can modify the inference file data/inference.txt to the desired audio files or use '--inference_file' to specify infrenece txt file. The artifacts will be saved in the output directory.

Quick Inference with mel-spectrogram output:

$ python ScalerGAN/inference.py --device cuda

Quick Inference with mel-spectrogram and audio output:

$ python ScalerGAN/inference.py --device cuda --infer_hifi

The flag '--infer_scales' can be used to specify the scales for inference. If not specified, the model will infer the default scales [0.5, 0.7, 0.9, 1.1, 1.3, 1.5].

Inference with specific checkpoint:

$ python ScalerGAN/inference.py --checkpoint_path <PATH/TO/CHECKPOINT> --device cuda

Fine Tuning:

For fine tuning the ScalerGAN model on a different dataset, one can use the following command:

$ python ScalerGAN/train.py --input_file <PATH/TO/INPUT.txt>  --device cuda --fine_tune --checkpoint_path <PATH/TO/CHECKPOINT>

For fine tuning the HiFi-GAN Vocoder do the following:

Run ScalerGAN inference on the desired dataset.

$ python ScalerGAN/inference.py --input_file <PATH/TO/INPUT.txt>  --device cuda

The cropped audio and the generated Mel spectrograms will be saved in the output directory, copy them to desired location by the HiFi-GAN Vocoder.

Follow the instructions in the HiFi-GAN repository for fine tuning the model.
Use the generated HiFi-GAN checkpoint and config for inference with ScalerGAN, using the flag '--hifi_config' and '--hifi_checkpoint':

$ python ScalerGAN/inference.py --checkpoint_path <PATH/TO/SCALER_GAN/CHECKPOINT> --device cuda --infer_hifi --hifi_config <PATH/TO/HIFI_CONFIG> --hifi_checkpoint <PATH/TO/HIFI_CHECKPOINT>

Please Note

Prediction/ Training is significantly faster if you run on GPU.
The model is trained on 22.05kHz audio files. If you run on a different sampling rate, the model will not work as expected.
The model crops the audio files to be divisible by '--must_divide' to fit the model architecture. The default value is 8.
The model is trained on 80 Mel bins. If you want to change it, you must change it in both the training and inference scripts.

Acknowledgements

This work was realized as part of the Speech Processing and Learning Lab at the Technion.
We referred to HiFi-GAN to implement this.

arvids / scaler_gan Goto Github PK

scaler_gan's Introduction