GithubHelp home page GithubHelp logo

v3ucn / chattts Goto Github PK

View Code? Open in Web Editor NEW

This project forked from 2noise/chattts

1.0 0.0 0.0 4.87 MB

A generative speech model for daily dialogue.

Home Page: https://2noise.com/

License: Other

Python 97.65% Go 2.35%

chattts's Introduction

2noise%2FChatTTS | Trendshift

ChatTTS

A generative speech model for daily dialogue.

Licence

Huggingface Open In Colab

English | 简体中文 | 日本語 | Русский

Introduction

ChatTTS is a text-to-speech model designed specifically for dialogue scenarios such as LLM assistant.

Supported Languages

  • English
  • Chinese
  • Coming Soon...

Highlights

You can refer to this video on Bilibili for the detailed description.

  1. Conversational TTS: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations.
  2. Fine-grained Control: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections.
  3. Better Prosody: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.

Dataset & Model

  • The main model is trained with Chinese and English audio data of 100,000+ hours.
  • The open-source version on HuggingFace is a 40,000 hours pre-trained model without SFT.

Roadmap

  • Open-source the 40k hour base model and spk_stats file
  • Open-source VQ encoder and Lora training code
  • Streaming audio generation without refining the text*
  • Open-source the 40k hour version with multi-emotion control
  • ChatTTS.cpp maybe? (PR or new repo are welcomed.)

Disclaimer

Important

This repo is for academic purposes only.

It is intended for educational and research use, and should not be used for any commercial or legal purposes. The authors do not guarantee the accuracy, completeness, or reliability of the information. The information and data used in this repo, are for academic and research purposes only. The data obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data.

ChatTTS is a powerful text-to-speech system. However, it is very important to utilize this technology responsibly and ethically. To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes. At the same time, we have internally trained a detection model and plan to open-source it in the future.

Contact

GitHub issues/PRs are always welcomed.

Formal Inquiries

For formal inquiries about the model and roadmap, please contact us at [email protected].

Online Chat

1. QQ Group (Chinese Social APP)
  • Group 1, 808364215 (Full)
  • Group 2, 230696694 (Full)
  • Group 3, 933639842

Installation (WIP)

Will be uploaded to pypi soon according to 2noise#269

1. Install Directly

pip install git+https://github.com/2noise/ChatTTS

2. Install from conda

git clone https://github.com/2noise/ChatTTS
cd ChatTTS
conda create -n chattts
conda activate chattts
pip install -r requirements.txt

Get Started

Install requirements

pip install --upgrade -r requirements.txt

Quick Start

1. Launch WebUI

python examples/web/webui.py

2. Infer by Command Line

It will save audio to ./output_audio_xxx.wav

python examples/cmd/run.py "Please input your text."

Basic

import ChatTTS
from IPython.display import Audio
import torchaudio

chat = ChatTTS.Chat()
chat.load_models(compile=False) # Set to True for better performance

texts = ["PUT YOUR TEXT HERE",]

wavs = chat.infer(texts, )

torchaudio.save("output1.wav", torch.from_numpy(wavs[0]), 24000)

Advanced

###################################
# Sample a speaker from Gaussian.

rand_spk = chat.sample_random_speaker()

params_infer_code = {
  'spk_emb': rand_spk, # add sampled speaker 
  'temperature': .3, # using custom temperature
  'top_P': 0.7, # top P decode
  'top_K': 20, # top K decode
}

###################################
# For sentence level manual control.

# use oral_(0-9), laugh_(0-2), break_(0-7) 
# to generate special token in text to synthesize.
params_refine_text = {
  'prompt': '[oral_2][laugh_0][break_6]'
} 

wavs = chat.infer(texts, params_refine_text=params_refine_text, params_infer_code=params_infer_code)

###################################
# For word level manual control.
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text,  params_infer_code=params_infer_code)
torchaudio.save("output2.wav", torch.from_numpy(wavs[0]), 24000)

Example: self introduction

inputs_en = """
chat T T S is a text to speech model designed for dialogue applications. 
[uv_break]it supports mixed language input [uv_break]and offers multi speaker 
capabilities with precise control over prosodic elements [laugh]like like 
[uv_break]laughter[laugh], [uv_break]pauses, [uv_break]and intonation. 
[uv_break]it delivers natural and expressive speech,[uv_break]so please
[uv_break] use the project responsibly at your own risk.[uv_break]
""".replace('\n', '') # English is still experimental.

params_refine_text = {
  'prompt': '[oral_2][laugh_0][break_4]'
} 
# audio_array_cn = chat.infer(inputs_cn, params_refine_text=params_refine_text)
audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)
torchaudio.save("output3.wav", torch.from_numpy(audio_array_en[0]), 24000)
intro_en_m.webm
intro_en_f.webm

FAQ

1. How much VRAM do I need? How about infer speed?

For a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090 GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.3.

2. Model stability is not good enough, with issues such as multi speakers or poor audio quality.

This is a problem that typically occurs with autoregressive models (for bark and valle). It's generally difficult to avoid. One can try multiple samples to find a suitable result.

3. Besides laughter, can we control anything else? Can we control other emotions?

In the current released model, the only token-level control units are [laugh], [uv_break], and [lbreak]. In future versions, we may open-source models with additional emotional control capabilities.

Acknowledgements

  • bark, XTTSv2 and valle demostrate a remarkable TTS result by an autoregressive-style system.
  • fish-speech reveals capability of GVQ as audio tokenizer for LLM modeling.
  • vocos which is used as a pretrained vocoder.

Special Appreciation

Related Resources

Thanks to all contributors for their efforts

contributors

counter

chattts's People

Contributors

lich99 avatar fumiama avatar anyvoiceai avatar yuan-manx avatar cronrpc avatar jonahzheng avatar libukai avatar neverbiasu avatar fuyuwei01 avatar honestqiao avatar eltociear avatar pjq avatar leon0425 avatar mike-freeai avatar ain-soph avatar utshomax avatar gary149 avatar cnjack avatar andylida avatar asamaayako avatar 6drf21e avatar kaixindelele avatar rasonyang avatar wuhongsheng avatar ox0400 avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.