GithubHelp home page GithubHelp logo

bshall / acoustic-model Goto Github PK

View Code? Open in Web Editor NEW
98.0 7.0 24.0 168 KB

Acoustic models for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

Home Page: https://bshall.github.io/soft-vc/

License: MIT License

Python 100.00%
pytorch representation-learning speech voice-conversion

acoustic-model's Introduction

Open In Colab

Acoustic-Model

Training and inference scripts for the acoustic models in A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion. For more details see soft-vc. Audio samples can be found here. Colab demo can be found here.

Soft-VC
Fig 1: Architecture of the voice conversion system. a) The discrete content encoder clusters audio features to produce a sequence of discrete speech units. b) The soft content encoder is trained to predict the discrete units. The acoustic model transforms the discrete/soft speech units into a target spectrogram. The vocoder converts the spectrogram into an audio waveform.

Example Usage

Programmatic Usage

import torch
import numpy as np

# Load checkpoint (either hubert_soft or hubert_discrete)
acoustic = torch.hub.load("bshall/acoustic-model:main", "hubert_soft").cuda()

# Load speech units
units = torch.from_numpy(np.load("path/to/units"))

# Generate mel-spectrogram
mel = acoustic.generate(units)

Script-Based Usage

usage: generate.py [-h] {soft,discrete} in-dir out-dir

Generate spectrograms from input speech units (discrete or soft).

positional arguments:
  {soft,discrete}  available models (HuBERT-Soft or HuBERT-Discrete)
  in-dir           path to the dataset directory.
  out-dir          path to the output directory.

optional arguments:
  -h, --help       show this help message and exit

Training

Step 1: Dataset Preparation

Download and extract the LJSpeech dataset. The training script expects the following tree structure for the dataset directory:

└───wavs
    ├───dev
    │   ├───LJ001-0001.wav
    │   ├───...
    │   └───LJ050-0278.wav
    └───train
        ├───LJ002-0332.wav
        ├───...
        └───LJ047-0007.wav

The train and dev directories should contain the training and validation splits respectively. The splits used for the paper can be found here.

Step 2: Extract Spectrograms

Extract mel-spectrograms using the mel.py script:

usage: mels.py [-h] in-dir out-dir

Extract mel-spectrograms for an audio dataset.

positional arguments:
  in-dir      path to the dataset directory.
  out-dir     path to the output directory.

optional arguments:
  -h, --help  show this help message and exit

for example:

python mel.py path/to/LJSpeech-1.1/wavs path/to/LJSpeech-1.1/mels

At this point the directory tree should look like:

├───mels
│   ├───...
└───wavs
    ├───...

Step 3: Extract Discrete or Soft Speech Units

Use the HuBERT-Soft or HuBERT-Discrete content encoders to extract speech units. First clone the content encoder repo and then run encode.py (see the repo for details):

usage: encode.py [-h] [--extension EXTENSION] {soft,discrete} in-dir out-dir

Encode an audio dataset.

positional arguments:
  {soft,discrete}       available models (HuBERT-Soft or HuBERT-Discrete)
  in-dir                path to the dataset directory.
  out-dir               path to the output directory.

optional arguments:
  -h, --help            show this help message and exit
  --extension EXTENSION
                        extension of the audio files (defaults to .flac).

for example:

python encode.py soft path/to/LJSpeech-1.1/wavs path/to/LJSpeech-1.1/soft --extension .wav

At this point the directory tree should look like:

├───mels
│   ├───...
├───soft/discrete
│   ├───...
└───wavs
    ├───...

Step 4: Train the Acoustic-Model

usage: train.py [-h] [--resume RESUME] [--discrete] dataset-dir checkpoint-dir

Train the acoustic model.

positional arguments:
  dataset-dir      path to the data directory.
  checkpoint-dir   path to the checkpoint directory.

optional arguments:
  -h, --help       show this help message and exit
  --resume RESUME  path to the checkpoint to resume from.
  --discrete       Use discrete units.

Links

Citation

If you found this work helpful please consider citing our paper:

@inproceedings{
    soft-vc-2022,
    author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},
    booktitle={ICASSP}, 
    title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion}, 
    year={2022}
}

acoustic-model's People

Contributors

bshall avatar qgentry avatar seastar105 avatar tarepan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

acoustic-model's Issues

Bug: Training crash with missing argument `discrete`

Summary

AcousticModel training by train.py crash with missing attribute error.
It is caused by missing parsearg attribute discrete.
It can be fixed with additional argument, so I made a pull request (#5).

Phenomena

When run train.py with proper dataset-dir and checkpoint-dir, it crash.
Error message argue that the attribute discrete is missing.

Error Message

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/softVC_AM/train.py", line 96, in train
    discrete=args.discrete,
AttributeError: 'Namespace' object has no attribute 'discrete'

Cause

In train.py, there is a argument usage args.discrete, but there is no corresponding parser.add_argument.

acoustic-model/train.py

Lines 87 to 91 in df6eba9

train_dataset = MelDataset(
root=args.dataset_dir,
train=True,
discrete=args.discrete,
)

Fix idea

As in paper, softVC-AM seems to support both soft and discrete.
So we can add discrete flag (by default, it works as soft mode).
When I add it, the bug disappear.

Notes

I make a pull request (#5) which will fix this bug.

Thanks for your great OSS! I am happy if this help you and community.

map_location argument is not supported

Typically it's possible to load torch models to cpu / gpu by using the map_location argument.

This doesn't work for the acoustic model:

TypeError: hubert_soft() got an unexpected keyword argument 'map_location

On a CPU-only machine loading this model gives the error:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu')
 to map your storages to the CPU.

Finetuned model while loading RuntimeError: Error(s) in loading state_dict for AcousticModel

@bshall Thank you for this great work.

I did fine-tune the pre-trained acoustic LJSpeech model with my custom dataset (~ 1 hour).

python train.py --resume checkpoints/hubert-soft-0321fd7e.pt data/ finetuned_checkpoints/

I have newly fine-tuned the best model (model-best.pt) with 20000 steps. I modified the code (https://github.com/bshall/acoustic-model/blob/main/acoustic/model.py#L119). the loading from the torch.hub.load_state_dict_from_url to my checkpoint path. but I got the below error. I shared the error log for your reference.

can you please help me, how to resolve this issue?

Thanks

Traceback (most recent call last):
  File "/root/Experiments/soft-vc/inference.py", line 12, in <module>
    acoustic = hubert_soft().cuda()
  File "/root/Experiments/soft-vc/acoustic/acoustic/model.py", line 165, in hubert_soft
    return _acoustic(
  File "/root/Experiments/soft-vc/acoustic/acoustic/model.py", line 133, in _acoustic
    acoustic.load_state_dict(checkpoint["acoustic-model"])
  File "/root/anaconda3/envs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for AcousticModel:
        Missing key(s) in state_dict: "encoder.prenet.net.0.weight", "encoder.prenet.net.0.bias", "encoder.prenet.net.3.weight", "encoder.prenet.net.3.bias", "encoder.convs.0.weight", "encoder.convs.0.bias", "encoder.convs.3.weight", "encoder.convs.3.bias", "encoder.convs.4.weight", "encoder.convs.4.bias", "encoder.convs.7.weight", "encoder.convs.7.bias", "decoder.prenet.net.0.weight", "decoder.prenet.net.0.bias", "decoder.prenet.net.3.weight", "decoder.prenet.net.3.bias", "decoder.lstm1.weight_ih_l0", "decoder.lstm1.weight_hh_l0", "decoder.lstm1.bias_ih_l0", "decoder.lstm1.bias_hh_l0", "decoder.lstm2.weight_ih_l0", "decoder.lstm2.weight_hh_l0", "decoder.lstm2.bias_ih_l0", "decoder.lstm2.bias_hh_l0", "decoder.lstm3.weight_ih_l0", "decoder.lstm3.weight_hh_l0", "decoder.lstm3.bias_ih_l0", "decoder.lstm3.bias_hh_l0", "decoder.proj.weight". 
        Unexpected key(s) in state_dict: "module.encoder.prenet.net.0.weight", "module.encoder.prenet.net.0.bias", "module.encoder.prenet.net.3.weight", "module.encoder.prenet.net.3.bias", "module.encoder.convs.0.weight", "module.encoder.convs.0.bias", "module.encoder.convs.3.weight", "module.encoder.convs.3.bias", "module.encoder.convs.4.weight", "module.encoder.convs.4.bias", "module.encoder.convs.7.weight", "module.encoder.convs.7.bias", "module.decoder.prenet.net.0.weight", "module.decoder.prenet.net.0.bias", "module.decoder.prenet.net.3.weight", "module.decoder.prenet.net.3.bias", "module.decoder.lstm1.weight_ih_l0", "module.decoder.lstm1.weight_hh_l0", "module.decoder.lstm1.bias_ih_l0", "module.decoder.lstm1.bias_hh_l0", "module.decoder.lstm2.weight_ih_l0", "module.decoder.lstm2.weight_hh_l0", "module.decoder.lstm2.bias_ih_l0", "module.decoder.lstm2.bias_hh_l0", "module.decoder.lstm3.weight_ih_l0", "module.decoder.lstm3.weight_hh_l0", "module.decoder.lstm3.bias_ih_l0", "module.decoder.lstm3.bias_hh_l0", "module.decoder.proj.weight". 
def _acoustic(
    name: str,
    discrete: bool,
    upsample: bool,
    pretrained: bool = True,
    progress: bool = True,
) -> AcousticModel:
    acoustic = AcousticModel(discrete, upsample)
    if pretrained:
        # checkpoint = torch.hub.load_state_dict_from_url(URLS[name], progress=progress)
        # consume_prefix_in_state_dict_if_present(checkpoint["acoustic-model"], "module.")
        
        load_path = "/root/Experiments/soft-vc/acoustic/finetuned_checkpoints/model-best.pt"
        checkpoint = torch.load(load_path)
        acoustic.load_state_dict(checkpoint["acoustic-model"])
        acoustic.eval()
    return acoustic 

Vietnamese language VC

Hi @bshall , can the pre-trained hubert-soft or discrete model be used for encoding mandarin Chinese language data? I want to train a model for Vietnamese language VC. But only train acoustic model and HiFiGAN vocoder on Vietnamese dataset.

Bug: `generate.py` failed with No such file error

Summary

unit-to-mel inference by generate.py crash with missing file error.
It is caused by variable name mistake in generate.py.
It can be fixed with one-line fix, so I made a pull request (#2).

Phenomena

When run generate.py with proper in-dir and out-dir, it crash.
Error message argue that No such file or directory: 'path'.

Error messages

Generating from sample_softVC -> o_test
  0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "./generate.py", line 57, in <module>
    generate(args)
  File "./generate.py", line 22, in generate
    units = np.load("path")
  File "/usr/local/lib/python3.7/dist-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'path'

Cause

In generate.py, variable path becomes mistakenly string "path".

units = np.load("path")

When I fix it, the bug disappear.

Notes

I make a pull request (#2) which will fix this bug.
I am so impressed with softVC project, so, If this PR will help this super cool project, I am grad.

Information about a complete training pipeline?

Greetings.

I am aware of the existence of the different repositories for the generation of a voice conversion model. However, few information about a whole training pipeline is covered in the repositories. Could the README.md file be extended with information for training a voice conversion model from scratch? Similar to the information provided in your parallel repository hubert, in order to perform a full training pipeline for a voice conversion model. Information such as:

  • Repository requirements in a requirements.txt file
  • Dataset requirements, in terms of audio characteristics, number of speakers (e.g. input and output voices) and directory structure
  • Steps required for training a model from scratch. e.g. execute preprocess.py -i foo -o bar, then train.py -i bar -o model_output...

Thanks in advance for your time.

switch to bigvgan

Hello,
i've been trying to drop-in bigvgan for hifigan but i keep running into an issue related to the number of mel channels the acoustic model is trained on 128 vs the 100 channels bigvgan uses. Is there a simple way to fix this or does the acoustic model need to be trained with 100 mel channels?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.