Comments (34)
Thank you for your interest in this work. I’m currently attending a few workshops and I’ll be busy with midterm exams after that, so the model release will be delayed a little bit. Expect it to arrive some time in early November.
from styletts2.
Unfortunately, somebody found a mistake in the training code and informed me via email. I checked the quality of the model, and it sounds worse than the demo because of the mistake (wrong reference audio). I have fixed the mistake but I have to retrain the model from scratch. Now expect the model to be released by mid-November. Sorry for the delay. I believe the current code should produce working models now.
from styletts2.
@gigadunk Here's the configuration that I am currently using to train the LibriTTS model. The dataset is very big so the epochs need to be adjusted according to the quality of the model.
log_dir: "Models/LibriTTS"
first_stage_path: "first_stage.pth"
save_freq: 1
log_interval: 10
device: "cuda"
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
batch_size: 16
max_len: 300 # maximum number of frames
pretrained_model: "Models/LibriTTS/epoch_2nd_00005.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters
F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'
data_params:
train_data: "Data/train_list.txt"
val_data: "Data/val_list.txt"
root_path: ""
OOD_data: "Data/OOD_texts.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts
preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300
model_params:
multispeaker: true
dim_in: 64
hidden_dim: 512
max_conv_dim: 512
n_layer: 3
n_mels: 80
n_token: 178 # number of phoneme tokens
max_dur: 50 # maximum duration of a single phoneme
style_dim: 128 # style vector size
dropout: 0.2
# config for decoder
decoder:
type: 'hifigan' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10,5,3,2]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20,10,6,4]
# speech language model config
slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head
# style diffusion model config
diffusion:
embedding_mask_proba: 0.1
# transformer config
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2
# diffusion distribution config
dist:
sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
mean: -3.0
std: 1.0
loss_params:
lambda_mel: 5. # mel reconstruction loss
lambda_gen: 1. # generator loss
lambda_slm: 1. # slm feature matching loss
lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 4 # TMA starting epoch (1st stage)
lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)
diff_epoch: 10 # style diffusion starting epoch (2nd stage)
joint_epoch: 15 # joint training starting epoch (2nd stage)
optimizer_params:
lr: 0.0001 # general learning rate
bert_lr: 0.00001 # learning rate for PLBERT
ft_lr: 0.00001 # learning rate for acoustic modules
slmadv_params:
min_len: 400 # minimum length of samples
max_len: 500 # maximum length of samples
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 20 # update the discriminator every this iterations of generator update
thresh: 5 # gradient norm above which the gradient is scaled
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
sig: 1.5 # sigma for differentiable duration modeling
from styletts2.
I think now I have got a better model and I will upload it to the repo. The qualify is very close to the demo now. There's still some small weird issues at the end of the model for some samples (not sure what causes these), and I'm trying to investigate the issue and maybe I can have a batter model without these problems later on.
from styletts2.
The current model quality is not bad though, so if you need the model now, you can download it here: https://drive.google.com/drive/folders/1ApqjyugCzr4EN2NFXa5Opfr3qcoapUPV?usp=sharing, but I can probably get a better model a couple of weeks later.
You only need to change the following code to run the inference:
def compute_style(path):
wave, sr = librosa.load(path, sr=24000)
audio, index = librosa.effects.trim(wave, top_db=30)
if sr != 24000:
audio = librosa.resample(audio, sr, 24000)
mel_tensor = preprocess(audio).to(device)
with torch.no_grad():
ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))
return torch.cat([ref_s, ref_p], dim=1)
reference = "Demo/1221-135767-0014.wav"
ref_s = compute_style(reference)
with torch.no_grad():
input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
text_mask = length_to_mask(input_lengths).to(device)
t_en = model.text_encoder(tokens, input_lengths, text_mask)
bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
embedding=bert_dur,
embedding_scale=1,
features=ref_s, # reference from the same speaker as the embedding
num_steps=10).squeeze(1)
s = s_pred[:, 128:]
ref = s_pred[:, :128]
alpha = 0.3 # how much you want to mix the sampled style with the original style (acoustic part)
beta = 0.7 # how much you want to mix the sampled style with the original style (prosodic part)
ref = alpha * ref + (1 - alpha) * ref_s[:, :128]
s = beta * s + (1 - beta) * ref_s[:, 128:]
d = model.predictor.text_encoder(d_en,
s, input_lengths, text_mask)
x, _ = model.predictor.lstm(d)
duration = model.predictor.duration_proj(x)
duration = torch.sigmoid(duration).sum(axis=-1)
pred_dur = torch.round(duration.squeeze()).clamp(min=1)
pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
c_frame = 0
for i in range(pred_aln_trg.size(0)):
pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
c_frame += int(pred_dur[i].data)
# encode prosody
en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
asr_new = torch.zeros_like(en)
asr_new[:, :, 0] = en[:, :, 0]
asr_new[:, :, 1:] = en[:, :, 0:-1]
en = asr_new
F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
asr_new = torch.zeros_like(asr)
asr_new[:, :, 0] = asr[:, :, 0]
asr_new[:, :, 1:] = asr[:, :, 0:-1]
asr = asr_new
out = model.decoder(asr,
F0_pred, N_pred, ref.squeeze().unsqueeze(0))
Formal inference demo including reproducing audio on the demo page will come later once the better model is done.
from styletts2.
I tested it on the colab and it works, so if you want to try it now you can use this link: https://colab.research.google.com/drive/1VENAg_TeKj5a1NYMJTSrbNLDlcIT30Sh
from styletts2.
Thanks for nice works
I'm looking for nice zero-shot TTS models. I also hope to use a StyleTTS 2 (LibriTTS Ver.) for a baseline model.
I think your model is much better than YourTTS or any recent LLM-based models in zero-shot TTS so I hope to compare your model as a state-of-the-art model.
It would be appreciate if you could share a plan for LibriTTS model 😃
from styletts2.
@sh-lee-prml Thanks for your appreciation of this work! As for your problem of inference using very short clips (less than one second), you probably have to repeat the reference until it reaches the minimum length, and it could lead to potential problems as there is no such data during training (clips shore than 1 second were excluded during training). If you do need to do inference with very short references, you may have to retrain or fine-tune the model with shorter clips, possibly with repeating to accommodate the receptive field of the style encoder.
The alpha and beta are just factors that control diversity and similarity. The higher the alpha and beta, the closer it is to the sampled style (and thus less similar to the actual reference style), and vice versa. It depends on the use case, i.e., do you want more diverse samples with the same text, or do you want more similar samples to the reference? Values ranging from 0.3
to 0.5
balance diversity and similarity.
The demo page indeed shows samples from LibriSpeech, because these were reference samples taken from the Vall-E and NaturalSpeech 2 demo pages. LibriTTS here refers to the model (i.e., model trained on LibriTTS), not the testing dataset. I have marked this difference in the paper. The Table 1 shows that the testing set for zero-shot experiments was LibriSpeech instead of LibriTTS.
from styletts2.
@yl4579 hi, yl4579. Look forward to the pretrained model on LibriTTS. Be grateful to you!
from styletts2.
Hey there @yl4579 , I'm hoping to test out the LibriTTS-trained StyleTTS 2 as well. Would it be possible to release the training config for the multi-speaker version so I can try and train it on my own machines before you release the pre-trained models?
P.S. Thanks for the work so far; the LJ version sounds very good.
from styletts2.
@yl4579
Since you're restarting the model from scratch, have you thought of implementing Descript Audio Codec?
With Descript Audio Codec, you can compress 44.1 KHz audio into discrete codes at a low 8 kbps bitrate. This universal model works on all domains (speech, environment, music, etc.), making it widely applicable to generative modeling of all audio. It can be used as a drop-in replacement for EnCodec for all audio language modeling applications (such as AudioLMs, MusicLMs, MusicGen, etc.)
https://github.com/descriptinc/descript-audio-codec
Demo:
https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5
from styletts2.
@GUUser91 Since StyleTTS2 is already an end-to-end model, meaning it generates waveforms directly from the text, I don’t see any use of this codec anywhere unless we don’t do end-to-end training, which may degrade the quality (though it could be faster in training).
from styletts2.
@yl4579 Thanks for sharing the checkpoint. Now, I'm synthesizing the speech with your model! 😀
However, I have some problems when I feed a very short reference audio to the style encoder because the fixed filter size of your style encoder. I have a simple trick to infer with short reference audio by just replicating audio before fed to style encoder. This may resolve this issue.
Could you recommend a proper value of alpha and beta for LibriTTS samples?
# the closer the alpha is to 0, the less diversity, but the more similar it is to the reference speaker in timbre
alpha = 0.3 # how much you want to mix the sampled style with the original style (acoustic part)
# the closer the beta is to 0, the less diversity, but the more similar it is to the reference speaker in prosody
beta = 0.5 # how much you want to mix the sampled style with the original style (prosodic part)
ref = alpha * ref + (1 - alpha) * ref_s[:, :128]
s = beta * s + (1 - beta) * ref_s[:, 128:]
In addition, I entirely agree that audio codec is not required for your model. The audio quality of StyleTTS 2 is already better than recent proposed 2-stage models such as Vall-E or NaturalSpeech 2 in terms of naturalness. Using audio codec will decrease the audio quality.
and I found a typo in your demo page.
https://styletts2.github.io/#libri
I could not find these sample from LibriTTS. It seems that these samples are from LibriSpeech, not LibriTTS.😉
Thanks again
from styletts2.
Since you are retraining... Would you be open to sharing the model weights in checkpoints instead of waiting for it to be fully trained? @yl4579
from styletts2.
@pawngrubber There are multiple stages, and it is quite inconvenient to upload the checkpoints as each one of them is around 2G big.
from styletts2.
Hi, just try your new colab, it works great.
But I got a problem, when I tried to change the text to Chinese
text = "如果这不是您看到的号码,请检查计算机上的代理设置。"
It shows
phonemizer:words count mismatch on 100.0% of the lines (1/1)
I did some google search and try to change to pinyin, well it can read, but not good, I can't understand it without looking at the text.
text = "rú guǒ zhè bú shì nín kàn dào de hào mǎ, qǐng jiǎnchá jìsuànjī shàng de dàilǐ shèzhì."
This is the audio generated
https://soundcloud.com/wooden-tank/chinese
Can you give me some tips to get a better result, thank you
from styletts2.
@yl4579
How come I get this error message If try to finetune the LibriTTS model?
accelerate launch train_first.py --config_path ./Configs/config.yml
The following values were not passed toaccelerate launch
and had defaults used instead:
--num_processes
was set to a value of1
--num_machines
was set to a value of1
--mixed_precision
was set to a value of'no'
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
.
bert loaded
Traceback (most recent call last):
File "/home/user/StyleTTS2/train_first.py", line 444, in
main()
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/user/StyleTTS2/train_first.py", line 152, in main
model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'],
File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint
model[key].load_state_dict(params[key])
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CustomAlbert:
Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.embedding_hidden_mapping_in.weight", "encoder.embedding_hidden_mapping_in.bias", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "pooler.weight", "pooler.bias".
Unexpected key(s) in state_dict: "module.embeddings.word_embeddings.weight", "module.embeddings.position_embeddings.weight", "module.embeddings.token_type_embeddings.weight", "module.embeddings.LayerNorm.weight", "module.embeddings.LayerNorm.bias", "module.encoder.embedding_hidden_mapping_in.weight", "module.encoder.embedding_hidden_mapping_in.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "module.pooler.weight", "module.pooler.bias".
Traceback (most recent call last):
File "/home/user/StyleTTS2/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command
simple_launcher(args)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/StyleTTS2/venv/bin/python3.10', 'train_first.py', '--config_path', './Configs/config.yml']' returned non-zero exit status 1.
from styletts2.
Hi, just try your new colab, it works great.
But I got a problem, when I tried to change the text to Chinese
text = "如果这不是您看到的号码,请检查计算机上的代理设置。"
It showsphonemizer:words count mismatch on 100.0% of the lines (1/1)
I did some google search and try to change to pinyin, well it can read, but not good, I can't understand it without looking at the text.
text = "rú guǒ zhè bú shì nín kàn dào de hào mǎ, qǐng jiǎnchá jìsuànjī shàng de dàilǐ shèzhì."
This is the audio generated https://soundcloud.com/wooden-tank/chineseCan you give me some tips to get a better result, thank you
As I know, this pretrain model does not support chinese or pinyin, only support English phone. Wish this can help you.
from styletts2.
@yl4579 How come I get this error message If try to finetune the LibriTTS model?
accelerate launch train_first.py --config_path ./Configs/config.yml
The following values were not passed toaccelerate launch
and had defaults used instead:
--num_processes
was set to a value of1
--num_machines
was set to a value of1
--mixed_precision
was set to a value of'no'
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
.
bert loaded
Traceback (most recent call last):
File "/home/user/StyleTTS2/train_first.py", line 444, in
main()
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/user/StyleTTS2/train_first.py", line 152, in main
model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'],
File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint
model[key].load_state_dict(params[key])
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CustomAlbert:
Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.embedding_hidden_mapping_in.weight", "encoder.embedding_hidden_mapping_in.bias", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "pooler.weight", "pooler.bias".
Unexpected key(s) in state_dict: "module.embeddings.word_embeddings.weight", "module.embeddings.position_embeddings.weight", "module.embeddings.token_type_embeddings.weight", "module.embeddings.LayerNorm.weight", "module.embeddings.LayerNorm.bias", "module.encoder.embedding_hidden_mapping_in.weight", "module.encoder.embedding_hidden_mapping_in.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.value.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn.bias", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.weight", "module.encoder.albert_layer_groups.0.albert_layers.0.ffn_output.bias", "module.pooler.weight", "module.pooler.bias".
Traceback (most recent call last):
File "/home/user/StyleTTS2/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command
simple_launcher(args)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/StyleTTS2/venv/bin/python3.10', 'train_first.py', '--config_path', './Configs/config.yml']' returned non-zero exit status 1.
I have started to train by load the pretrained model of stage2. i find your mistake, this pretrained model is stage2 ,but your command is stage1. Wish this can help you.
from styletts2.
@WendongGan
Thank you. Now I get a OOM error message. I set the batch size to 4 and batch_percentage down to 0.3. I have GeForce RTX 4060 Ti 16GB Model.
Edit: Nevermind. I figured it by lowering the max_len down to 60.
from styletts2.
I'm using the finetuned model from yl4579's StyleTTS2_libritts_debug.ipynb file. I get this error message after setting it to use the finetune model
ValueError Traceback (most recent call last)
Cell In[55], line 40
36 duration = torch.sigmoid(duration).sum(axis=-1)
37 pred_dur = torch.round(duration.squeeze()).clamp(min=1)
---> 40 pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
41 c_frame = 0
42 for i in range(pred_aln_trg.size(0)):
ValueError: cannot convert float NaN to integer
I set the finetuned model settings to this:
params_whole = torch.load("LibriTTS_debug/epoch_2nd_00004.pth", map_location='cpu')
params = params_whole['net']
Edit: Nevermind again. I fixed the problem by reinstalling StyleTTS2. I finetuned the model again. The only thing I edit in yl4579's StyleTTS2_libritts_debug.ipynb file was the location of the .pth file. Then I no longer got the error message.
params_whole = torch.load("/home/user/StyleTTS2/Models/LibriTTS/epoch_2nd_00004.pth", map_location='cpu')
params = params_whole['net']
from styletts2.
@GUUser91 im having same issue trying to fine-tune
can you share the StyleTTS2_libritts_debug.ipynb
notebook? I can't seem to find it
from styletts2.
from styletts2.
i was just looking at that- it appears to demonstrate inference using reference audio, not fine tuning of the model
from styletts2.
@eschmidbauer
I edit the config file
#20 (comment)
Then I finetune the model
python train_second.py --config_path ./Configs/config.yml
from styletts2.
thanks @GUUser91 that is what i was looking for!! appreciate the help
from styletts2.
Heya @yl4579. I'm a little confused, I thought you trained the model used for the StyleTTS2 Demo.
Why are you retraining it if you already have the model used for the demo?
I'm probably missing context or something.
Thanks :)
from styletts2.
@gigadunk The reason is I want to test if the code in the repo is working. I want to reproduce the models I used for the paper with the cleaned code, as it can be a little different from the one I had for the experiments (with Jupyter notebooks). See #1 for more context. The quality is very similar now, only there's a weird pulse (only for some reference) at the end of the speech, which can be easily fixed with [:-50] (removing the last 50 samples). I believe this is a minor issue and may be caused by some preprocessing in meldataset.py that might be a little different from the one I used for LibriTTS dataset for the paper.
from styletts2.
@yl4579 Thanks for the clarification :)
I'm hyped to play around with the new model, when will it be on the repo?
from styletts2.
@gigadunk I'm making the demo now. It should be up today.
from styletts2.
I have pushed the demo notebook and uploaded the model. This issue should now be complete. If you find other problems of the model, please open new issues.
from styletts2.
could you share the checkpoint for fine-tuning?
from styletts2.
@eschmidbauer It’s in the README now.
from styletts2.
ok - i tried finetuning with the libritts model and i get a state missing error. Perhaps it's the config im using no longer works with that pretrained model
from styletts2.
Related Issues (20)
- FP8 Fine Tuning Crashes HOT 1
- Error Message After Using a fine tuned ASR Model
- Stage 2 Training Fails with NaN Loss on Single GPU Due to Inconsistent Checkpoint Keys
- Getting CUDA Out of memory error in Stage2 training HOT 13
- Multi-lingual training HOT 17
- In training Stage1 after 49th epoch getting RuntimeError: you can only change requires_grad flags of leaf variables, g_loss.requires_grad = True
- First stage training after 49th epoch (i.e., when epoch >= TMA_epoch)
- Getting error in d_loss.backward() of first_stage training
- Can the model learn accents not supported by espeak-ng?
- Joint training is failing with Assertion error
- In 2nd stage training AttributeError: 'AudioDiffusionConditional' object has no attribute 'module'
- Questions about Differentiable Duration Modeling HOT 1
- weird chinese pronunciation HOT 3
- Training PL-BERT on styletts2-community/multilingual-pl-bert
- Can anyone please share checkpoints that we get after we complete both stages of training HOT 3
- Model Size of fine tuned Model
- Can StyleTTS2 use phonemization from different languages to finetune or train?
- StyleTTS Python API doesn't detect devanagari script
- After training 1 epoch, train_first.py crashes: RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 1, 800] HOT 1
- Do we need lr scheduler?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from styletts2.