GithubHelp home page GithubHelp logo

zhangxinfd / speechtokenizer Goto Github PK

View Code? Open in Web Editor NEW
380.0 380.0 32.0 708 KB

This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on

Home Page: https://0nutation.github.io/SpeechTokenizer.github.io/

License: Apache License 2.0

Roff 0.01% Python 99.29% Shell 0.71%

speechtokenizer's People

Contributors

0nutation avatar eltociear avatar karthik19967829 avatar keikinn avatar zhangxinfd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speechtokenizer's Issues

About HuBERT unit

Is HuBERT unit a method that performs k-means on the features of HuBERT output, as implemented in speech2unit described at https://github.com/facebookresearch/speech-resynthesis?

Also, when I use that method to get the HuBERT unit, using the speech waveform of shape (1, 32000), the output of the first rvq layer is of shape (1, 100, 128) for HuBERT features and of shape (1, 99, 768) and (1, 99) for HuBERT unit features. Is this the right way to get the HuBERT unit?

ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub'

python3.10
huggingface_hub: 0.20.3

cmd:

from speechtokenizer import SpeechTokenizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ehosseiniasl/github_repos/SpeechTokenizer/speechtokenizer/__init__.py", line 2, in <module>
    from .trainer import SpeechTokenizerTrainer
  File "/home/ehosseiniasl/github_repos/SpeechTokenizer/speechtokenizer/trainer/__init__.py", line 1, in <module>
    from .trainer import SpeechTokenizerTrainer
  File "/home/ehosseiniasl/github_repos/SpeechTokenizer/speechtokenizer/trainer/trainer.py", line 20, in <module>
    from accelerate import Accelerator, DistributedType, DistributedDataParallelKwargs, DataLoaderConfiguration
  File "/home/ehosseiniasl/.local/lib/python3.10/site-packages/accelerate/__init__.py", line 16, in <module>
    from .accelerator import Accelerator
  File "/home/ehosseiniasl/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 34, in <module>
    from huggingface_hub import split_torch_state_dict_into_shards
ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub' (/home/ehosseiniasl/.local/lib/python3.10/site-packages/huggingface_hub/__init__.py)

Cross-lingual

Hello, I used the checkpoint file you trained with librispeech to infer the Chinese audio and it still works well. Is that what you expected? Because your dataset doesn't seem to use Chinese, only English data.

Training process?

Hi Authors,

This paper is so unbelievably useful for my research - thank you, what a great idea and paper this is!
I'm looking through the code to learn about this modification to the quantizer allowing for it to learn semantic distillation after the first RVQ pass with help from Hubert, but, it seems the model uploaded is just encodec? I can't find the correct snippet nor can piece together how one may train this.

Thank you

10 batchsize = 63.72G???(All File are 3s wav)

I would like to ask the following 2 questions and I hope I can get some help from you

  1. I'm fine-tuning speechtokenizer on my own dataset, I cut the dataset into 3s wav files, but when I set the batchsize equal to 10, the code takes up a whopping 63.72G of video memory, and I'd like to ask if this is reasonable? Is it possible that there is an error in my other settings?

  2. Also I would like to ask you what is the approximate value of the loss convergence after training?I'd like to make a preliminary judgement on how well the model fine-tuning turned out by comparing it to your loss
    (Gen Loss ,Mel Error,Q Loss,Distill Loss)

Here's what I'm fine-tuning:

Epoch 0 -- Step 48410: Gen Loss: 192.288; Mel Error:0.343; Q Loss: 5.999; Distill Loss: 0.643; Time cost per step: 3.779s
Epoch 0 -- Step 48420: Gen Loss: 200.187; Mel Error:0.329; Q Loss: 6.648; Distill Loss: 0.627; Time cost per step: 3.761s
Epoch 0 -- Step 48430: Gen Loss: 206.039; Mel Error:0.341; Q Loss: 6.689; Distill Loss: 0.616; Time cost per step: 3.722s
Epoch 0 -- Step 48440: Gen Loss: 178.676; Mel Error:0.358; Q Loss: 5.539; Distill Loss: 0.615; Time cost per step: 3.758s
Epoch 0 -- Step 48450: Gen Loss: 188.434; Mel Error:0.327; Q Loss: 5.698; Distill Loss: 0.625; Time cost per step: 3.734s
Epoch 0 -- Step 48460: Gen Loss: 185.933; Mel Error:0.348; Q Loss: 5.768; Distill Loss: 0.620; Time cost per step: 3.711s
Epoch 0 -- Step 48470: Gen Loss: 196.693; Mel Error:0.344; Q Loss: 6.094; Distill Loss: 0.621; Time cost per step: 3.733s
Epoch 0 -- Step 48480: Gen Loss: 206.974; Mel Error:0.323; Q Loss: 7.111; Distill Loss: 0.607; Time cost per step: 3.739s
Epoch 0 -- Step 48490: Gen Loss: 201.758; Mel Error:0.370; Q Loss: 6.769; Distill Loss: 0.625; Time cost per step: 3.692s

Thanks!

distill loss weight?

I can't find the distill loss weight in paper. I try to copy the distillation experiment on dac, the reconstruction seems normal but the first codebook seems doesn't disentangle the semantic information.

How to deal with the integer values of RVQ

Hi author,
I've been experimenting with encoding audio using your fantastic method, and I noticed that the RVQ (Residual Vector Quantization) values I obtain are integers like the follows:
values

I'm curious if this is expected behavior. Additionally, I'm interested in using these encoded features for downstream tasks, but I'm unsure about how to adjust these integer values for training purposes. Would it be appropriate to apply normalization techniques such as min-max scaling or Z-Score normalization? The distribution of these encoded feature values is unknown to me, so I'm seeking guidance on how to handle them effectively for training.

Any advice or suggestions on how to deal with these encoded feature values would be greatly appreciated.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.