zhangxinfd / speechtokenizer Goto Github PK

This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on

Home Page: https://0nutation.github.io/SpeechTokenizer.github.io/

License: Apache License 2.0

Roff 0.01% Python 99.29% Shell 0.71%

speechtokenizer's People

Contributors

Stargazers

Watchers

speechtokenizer's Issues

Could you kindly share the training code?

About HuBERT unit

Is HuBERT unit a method that performs k-means on the features of HuBERT output, as implemented in speech2unit described at https://github.com/facebookresearch/speech-resynthesis?

Also, when I use that method to get the HuBERT unit, using the speech waveform of shape (1, 32000), the output of the first rvq layer is of shape (1, 100, 128) for HuBERT features and of shape (1, 99, 768) and (1, 99) for HuBERT unit features. Is this the right way to get the HuBERT unit?

ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub'

python3.10
huggingface_hub: 0.20.3

cmd:

from speechtokenizer import SpeechTokenizer

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ehosseiniasl/github_repos/SpeechTokenizer/speechtokenizer/__init__.py", line 2, in <module>
    from .trainer import SpeechTokenizerTrainer
  File "/home/ehosseiniasl/github_repos/SpeechTokenizer/speechtokenizer/trainer/__init__.py", line 1, in <module>
    from .trainer import SpeechTokenizerTrainer
  File "/home/ehosseiniasl/github_repos/SpeechTokenizer/speechtokenizer/trainer/trainer.py", line 20, in <module>
    from accelerate import Accelerator, DistributedType, DistributedDataParallelKwargs, DataLoaderConfiguration
  File "/home/ehosseiniasl/.local/lib/python3.10/site-packages/accelerate/__init__.py", line 16, in <module>
    from .accelerator import Accelerator
  File "/home/ehosseiniasl/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 34, in <module>
    from huggingface_hub import split_torch_state_dict_into_shards
ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub' (/home/ehosseiniasl/.local/lib/python3.10/site-packages/huggingface_hub/__init__.py)

Cross-lingual

Hello, I used the checkpoint file you trained with librispeech to infer the Chinese audio and it still works well. Is that what you expected? Because your dataset doesn't seem to use Chinese, only English data.

Training process?

Hi Authors,

This paper is so unbelievably useful for my research - thank you, what a great idea and paper this is!
I'm looking through the code to learn about this modification to the quantizer allowing for it to learn semantic distillation after the first RVQ pass with help from Hubert, but, it seems the model uploaded is just encodec? I can't find the correct snippet nor can piece together how one may train this.

Thank you

10 batchsize = 63.72G???(All File are 3s wav)

I would like to ask the following 2 questions and I hope I can get some help from you

I'm fine-tuning speechtokenizer on my own dataset, I cut the dataset into 3s wav files, but when I set the batchsize equal to 10, the code takes up a whopping 63.72G of video memory, and I'd like to ask if this is reasonable? Is it possible that there is an error in my other settings?
Also I would like to ask you what is the approximate value of the loss convergence after training?I'd like to make a preliminary judgement on how well the model fine-tuning turned out by comparing it to your loss
（Gen Loss ，Mel Error，Q Loss，Distill Loss）

Here's what I'm fine-tuning：

Epoch 0 -- Step 48410: Gen Loss: 192.288; Mel Error:0.343; Q Loss: 5.999; Distill Loss: 0.643; Time cost per step: 3.779s
Epoch 0 -- Step 48420: Gen Loss: 200.187; Mel Error:0.329; Q Loss: 6.648; Distill Loss: 0.627; Time cost per step: 3.761s
Epoch 0 -- Step 48430: Gen Loss: 206.039; Mel Error:0.341; Q Loss: 6.689; Distill Loss: 0.616; Time cost per step: 3.722s
Epoch 0 -- Step 48440: Gen Loss: 178.676; Mel Error:0.358; Q Loss: 5.539; Distill Loss: 0.615; Time cost per step: 3.758s
Epoch 0 -- Step 48450: Gen Loss: 188.434; Mel Error:0.327; Q Loss: 5.698; Distill Loss: 0.625; Time cost per step: 3.734s
Epoch 0 -- Step 48460: Gen Loss: 185.933; Mel Error:0.348; Q Loss: 5.768; Distill Loss: 0.620; Time cost per step: 3.711s
Epoch 0 -- Step 48470: Gen Loss: 196.693; Mel Error:0.344; Q Loss: 6.094; Distill Loss: 0.621; Time cost per step: 3.733s
Epoch 0 -- Step 48480: Gen Loss: 206.974; Mel Error:0.323; Q Loss: 7.111; Distill Loss: 0.607; Time cost per step: 3.739s
Epoch 0 -- Step 48490: Gen Loss: 201.758; Mel Error:0.370; Q Loss: 6.769; Distill Loss: 0.625; Time cost per step: 3.692s

Thanks!

distill loss weight?

I can't find the distill loss weight in paper. I try to copy the distillation experiment on dac, the reconstruction seems normal but the first codebook seems doesn't disentangle the semantic information.

zero grad issus in encodec?

facebookresearch/encodec#25
will this issue have some impact on the training of speechtokenizer?

what is the input when inference for encoding?

what is the input when inference for encoding? I think only raw audio is the input, no stft or mel spectrum is needed for inference, is that right?

How to deal with the integer values of RVQ

Hi author,
I've been experimenting with encoding audio using your fantastic method, and I noticed that the RVQ (Residual Vector Quantization) values I obtain are integers like the follows:

I'm curious if this is expected behavior. Additionally, I'm interested in using these encoded features for downstream tasks, but I'm unsure about how to adjust these integer values for training purposes. Would it be appropriate to apply normalization techniques such as min-max scaling or Z-Score normalization? The distribution of these encoded feature values is unknown to me, so I'm seeking guidance on how to handle them effectively for training.

Any advice or suggestions on how to deal with these encoded feature values would be greatly appreciated.

Thank you!

zhangxinfd / speechtokenizer Goto Github PK

speechtokenizer's People

Contributors

Stargazers

Watchers

Forkers

speechtokenizer's Issues

Could you kindly share the training code?

About HuBERT unit

ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub'

Cross-lingual

Training process?

10 batchsize = 63.72G???(All File are 3s wav)

distill loss weight?

zero grad issus in encodec?

what is the input when inference for encoding?

How to deal with the integer values of RVQ

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs