GithubHelp home page GithubHelp logo

matatonic / openedai-speech Goto Github PK

View Code? Open in Web Editor NEW
255.0 8.0 38.0 1.29 MB

An OpenAI API compatible text to speech server using Coqui AI's xtts_v2 and/or piper tts as the backend.

License: GNU Affero General Public License v3.0

Python 85.81% Dockerfile 3.45% Shell 8.33% Batchfile 2.42%

openedai-speech's Introduction

OpenedAI Speech

An OpenAI API compatible text to speech server.

  • Compatible with the OpenAI audio/speech API
  • Serves the /v1/audio/speech endpoint
  • Not affiliated with OpenAI in any way, does not require an OpenAI API Key
  • A free, private, text-to-speech server with custom voice cloning

Full Compatibility:

  • tts-1: alloy, echo, fable, onyx, nova, and shimmer (configurable)
  • tts-1-hd: alloy, echo, fable, onyx, nova, and shimmer (configurable, uses OpenAI samples by default)
  • response_format: mp3, opus, aac, flac, wav and pcm
  • speed 0.25-4.0 (and more)

Details:

  • Model tts-1 via piper tts (very fast, runs on cpu)
    • You can map your own piper voices via the voice_to_speaker.yaml configuration file
  • Model tts-1-hd via coqui-ai/TTS xtts_v2 voice cloning (fast, but requires around 4GB GPU VRAM)
  • Occasionally, certain words or symbols may sound incorrect, you can fix them with regex via pre_process_map.yaml
  • Tested with python 3.9-3.11, piper does not install on python 3.12 yet

If you find a better voice match for tts-1 or tts-1-hd, please let me know so I can update the defaults.

Recent Changes

Version 0.18.2, 2024-08-16

  • Fix docker building for amd64, refactor github actions again, free up more disk space

Version 0.18.1, 2024-08-15

  • refactor github actions

Version 0.18.0, 2024-08-15

  • Allow folders of wav samples in xtts. Samples will be combined, allowing for mixed voices and collections of small samples. Still limited to 30 seconds total. Thanks @nathanhere.
  • Fix missing yaml requirement in -min image
  • fix fr_FR-tom-medium and other 44khz piper voices (detect non-default sample rates)
  • minor updates

Version 0.17.2, 2024-07-01

  • fix -min image (re: langdetect)

Version 0.17.1, 2024-07-01

  • fix ROCm (add langdetect to requirements-rocm.txt)
  • Fix zh-cn for xtts

Version 0.17.0, 2024-07-01

Version 0.16.0, 2024-06-29

  • Multi-client safe version. Audio generation is synchronized in a single process. The estimated 'realtime' factor of XTTS on a GPU is roughly 1/3, this means that multiple streams simultaneously, or speed over 2, may experience audio underrun (delays or pauses in playback). This makes multiple clients possible and safe, but in practice 2 or 3 simultaneous streams is the maximum without audio underrun.

Version 0.15.1, 2024-06-27

  • Remove deepspeed from requirements.txt, it's too complex for typical users. A more detailed deepspeed install document will be required.

Version 0.15.0, 2024-06-26

  • Switch to coqui-tts (updated fork), updated simpler dependencies, torch 2.3, etc.
  • Resolve cuda threading issues

Version 0.14.1, 2024-06-26

  • Make deepspeed possible (--use-deepspeed), but not enabled in pre-built docker images (too large). Requires the cuda-toolkit installed, see the Dockerfile comment for details

Version 0.14.0, 2024-06-26

  • Added response_format: wav and pcm support
  • Output streaming (while generating) for tts-1 and tts-1-hd
  • Enhanced generation parameters for xtts models (temperature, top_p, etc.)
  • Idle unload timer (optional) - doesn't work perfectly yet
  • Improved error handling

Version 0.13.0, 2024-06-25

  • Added Custom fine-tuned XTTS model support
  • Initial prebuilt arm64 image support (Apple M-series, Raspberry Pi - MPS is not supported in XTTS/torch), thanks @JakeStevenson, @hchasens
  • Initial attempt at AMD GPU (ROCm 5.7) support
  • Parler-tts support removed
  • Move the *.default.yaml to the root folder
  • Run the docker as a service by default (restart: unless-stopped)
  • Added audio_reader.py for streaming text input and reading long texts

Version 0.12.3, 2024-06-17

  • Additional logging details for BadRequests (400)

Version 0.12.2, 2024-06-16

  • Fix :min image requirements (numpy<2?)

Version 0.12.0, 2024-06-16

  • Improved error handling and logging
  • Restore the original alloy tts-1-hd voice by default, use alloy-alt for the old voice.

Version 0.11.0, 2024-05-29

  • 🌐 Multilingual support (16 languages) with XTTS
  • Remove high Unicode filtering from the default config/pre_process_map.yaml
  • Update Docker build & app startup. thanks @justinh-rahb
  • Fix: "Plan failed with a cudnnException"
  • Remove piper cuda support

Version: 0.10.1, 2024-05-05

  • Remove runtime: nvidia from docker-compose.yml, this assumes nvidia/cuda compatible runtime is available by default. thanks @jmtatsch

Version: 0.10.0, 2024-04-27

  • Pre-built & tested docker images, smaller docker images (8GB or 860MB)
  • Better upgrades: reorganize config files under config/, voice models under voices/
  • Compatibility! If you customized your voice_to_speaker.yaml or pre_process_map.yaml you need to move them to the config/ folder.
  • default listen host to 0.0.0.0

Version: 0.9.0, 2024-04-23

  • Fix bug with yaml and loading UTF-8
  • New sample text-to-speech application say.py
  • Smaller docker base image
  • Add beta parler-tts support (you can describe very basic features of the speaker voice), See: (https://www.text-description-to-speech.com/) for some examples of how to describe voices. Voices can be defined in the voice_to_speaker.default.yaml. Two example parler-tts voices are included in the voice_to_speaker.default.yaml file. parler-tts is experimental software and is kind of slow. The exact voice will be slightly different each generation but should be similar to the basic description.

...

Version: 0.7.3, 2024-03-20

  • Allow different xtts versions per voice in voice_to_speaker.yaml, ex. xtts_v2.0.2
  • Quality: Fix xtts sample rate (24000 vs. 22050 for piper) and pops

Installation instructions

Create a speech.env environment file

Copy the sample.env to speech.env (customize if needed)

cp sample.env speech.env

Defaults

TTS_HOME=voices
HF_HOME=voices
#PRELOAD_MODEL=xtts
#PRELOAD_MODEL=xtts_v2.0.2
#EXTRA_ARGS=--log-level DEBUG --unload-timer 300
#USE_ROCM=1

Option A: Manual installation

# install curl and ffmpeg
sudo apt install curl ffmpeg
# Create & activate a new virtual environment (optional but recommended)
python -m venv .venv
source .venv/bin/activate
# Install the Python requirements
# - use requirements-rocm.txt for AMD GPU (ROCm support)
# - use requirements-min.txt for piper only (CPU only)
pip install -U -r requirements.txt
# run the server
bash startup.sh

On first run, the voice models will be downloaded automatically. This might take a while depending on your network connection.

Option B: Docker Image (recommended)

Nvidia GPU (cuda)

docker compose up

AMD GPU (ROCm support)

docker compose -f docker-compose.rocm.yml up

ARM64 (Apple M-series, Raspberry Pi)

XTTS only has CPU support here and will be very slow, you can use the Nvidia image for XTTS with CPU (slow), or use the piper only image (recommended)

CPU only, No GPU (piper only)

For a minimal docker image with only piper support (<1GB vs. 8GB).

docker compose -f docker-compose.min.yml up

Server Options

usage: speech.py [-h] [--xtts_device XTTS_DEVICE] [--preload PRELOAD] [--unload-timer UNLOAD_TIMER] [--use-deepspeed] [--no-cache-speaker] [-P PORT] [-H HOST]
                 [-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

OpenedAI Speech API Server

options:
  -h, --help            show this help message and exit
  --xtts_device XTTS_DEVICE
                        Set the device for the xtts model. The special value of 'none' will use piper for all models. (default: cuda)
  --preload PRELOAD     Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use. (default: None)
  --unload-timer UNLOAD_TIMER
                        Idle unload timer for the XTTS model in seconds, Ex. 900 for 15 minutes (default: None)
  --use-deepspeed       Use deepspeed with xtts (this option is unsupported) (default: False)
  --no-cache-speaker    Don't use the speaker wav embeddings cache (default: False)
  -P PORT, --port PORT  Server tcp port (default: 8000)
  -H HOST, --host HOST  Host to listen on, Ex. 0.0.0.0 (default: 0.0.0.0)
  -L {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the log level (default: INFO)

Sample Usage

You can use it like this:

curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy",
    "response_format": "mp3",
    "speed": 1.0
  }' > speech.mp3

Or just like this:

curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3

Or like this example from the OpenAI Text to speech guide:

import openai

client = openai.OpenAI(
  # This part is not needed if you set these environment variables before import openai
  # export OPENAI_API_KEY=sk-11111111111
  # export OPENAI_BASE_URL=http://localhost:8000/v1
  api_key = "sk-111111111",
  base_url = "http://localhost:8000/v1",
)

with client.audio.speech.with_streaming_response.create(
  model="tts-1",
  voice="alloy",
  input="Today is a wonderful day to build something people love!"
) as response:
  response.stream_to_file("speech.mp3")

Also see the say.py sample application for an example of how to use the openai-python API.

# play the audio, requires 'pip install playsound'
python say.py -t "The quick brown fox jumped over the lazy dog." -p
# save to a file in flac format
python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac

You can also try the included audio_reader.py for listening to longer text and streamed input.

Example usage:

python audio_reader.py -s 2 < LICENSE # read the software license - fast

OpenAI API Documentation and Guide

Custom Voices Howto

Piper

  1. Select the piper voice and model from the piper samples
  2. Update the config/voice_to_speaker.yaml with a new section for the voice, for example:
...
tts-1:
  ryan:
    model: voices/en_US-ryan-high.onnx
    speaker: # default speaker
  1. New models will be downloaded as needed, of you can download them in advance with download_voices_tts-1.sh. For example:
bash download_voices_tts-1.sh en_US-ryan-high

Coqui XTTS v2

Coqui XTTS v2 voice cloning can work with as little as 6 seconds of clear audio. To create a custom voice clone, you must prepare a WAV file sample of the voice.

Guidelines for preparing good sample files for Coqui XTTS v2

  • Mono (single channel) 22050 Hz WAV file
  • 6-30 seconds long - longer isn't always better (I've had some good results with as little as 4 seconds)
  • low noise (no hiss or hum)
  • No partial words, breathing, laughing, music or backgrounds sounds
  • An even speaking pace with a variety of words is best, like in interviews or audiobooks.
  • Audio longer than 30 seconds will be silently truncated.

You can use FFmpeg to prepare your audio files, here are some examples:

# convert a multi-channel audio file to mono, set sample rate to 22050 hz, trim to 6 seconds, and output as WAV file.
ffmpeg -i input.mp3 -ac 1 -ar 22050 -t 6 -y me.wav
# use a simple noise filter to clean up audio, and select a start time start for sampling.
ffmpeg -i input.wav -af "highpass=f=200, lowpass=f=3000" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav
# A more complex noise reduction setup, including volume adjustment
ffmpeg -i input.mkv -af "highpass=f=200, lowpass=f=3000, volume=5, afftdn=nf=25" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav

Once your WAV file is prepared, save it in the /voices/ directory and update the config/voice_to_speaker.yaml file with the new file name.

For example:

...
tts-1-hd:
  me:
    model: xtts
    speaker: voices/me.wav # this could be you

You can also use a sub folder for multiple audio samples to combine small samples or to mix different samples together.

For example:

...
tts-1-hd:
  mixed:
    model: xtts
    speaker: voices/mixed

Where the voices/mixed/ folder contains multiple wav files. The total audio length is still limited to 30 seconds.

Multilingual

Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice.

Coqui XTTSv2 has support for multiple languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Hungarian (hu), Korean (ko), Japanese (ja), and Hindi (hi). When not set, an attempt will be made to automatically detect the language, falling back to English (en).

Unfortunately the OpenAI API does not support language, but you can create your own custom speaker voice and set the language for that.

  1. Create the WAV file for your speaker, as in Custom Voices Howto
  2. Add the voice to config/voice_to_speaker.yaml and include the correct Coqui language code for the speaker. For example:
  xunjiang:
    model: xtts
    speaker: voices/xunjiang.wav
    language: zh-cn
  1. Don't remove high unicode characters in your config/pre_process_map.yaml! If you have these lines, you will need to remove them. For example:

Remove:

- - '[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251]+'
  - ''

These lines were added to the config/pre_process_map.yaml config file by default before version 0.11.0:

  1. Your new multi-lingual speaker voice is ready to use!

Custom Fine-Tuned Model Support

Adding a custom xtts model is simple. Here is an example of how to add a custom fine-tuned 'halo' XTTS model.

  1. Save the model folder under voices/ (all 4 files are required, including the vocab.json from the model)
openedai-speech$ ls voices/halo/
config.json  vocab.json  model.pth  sample.wav
  1. Add the custom voice entry under the tts-1-hd section of config/voice_to_speaker.yaml:
tts-1-hd:
...
  halo:
    model: halo # This name is required to be unique
    speaker: voices/halo/sample.wav # voice sample is required
    model_path: voices/halo
  1. The model will be loaded when you access the voice for the first time (--preload doesn't work with custom models yet)

Generation Parameters

The generation of XTTSv2 voices can be fine tuned with the following options (defaults included below):

tts-1-hd:
  alloy:
    model: xtts
    speaker: voices/alloy.wav
    enable_text_splitting: True
    length_penalty: 1.0
    repetition_penalty: 10
    speed: 1.0
    temperature: 0.75
    top_k: 50
    top_p: 0.85

openedai-speech's People

Contributors

matatonic avatar rodolfocastanheira avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openedai-speech's Issues

Fine Tuned xttsv2

I see that this supports voice cloning but I haven't seen anywhere in the docs that this can be used with a fine tuned xtts v2 model.

Is that currently possible?

Support Python 3.12, Torch 2.4, CUDA 12.4

Torch 2.3.x only supports CUDA 12.1
Torch 2.4, which came out 3 weeks ago (current Docker builds are older), supports CUDA 12.4 and also Python 3.12
Upcoming Torch 2.4.1 seems to support CUDA 12.5.

I see the dependencies simply specify latest torch, so maybe just rebuilding the current image and pushing an update would be enough?

EDIT: Apparently I am ignorant for thinking you need to match host CUDA version to the torch version, newer CUDA on host seems to work fine with older CUDA on apps and was running into a Docker documentation issue with not being able to get it to run, but updating the image would still be nice.

Feature Request: Model unload

Greetings,
just like matatonic/openedai-whisper#1.

I´ve got a quick and dirty implemention running for openedai.speech already. Code is attached, changes are marked.

speech.py:

#!/usr/bin/env python3
import argparse
import os
import sys
import re
import subprocess
import tempfile
import yaml
#neubsi
import threading
import time
#~neubsi
from fastapi.responses import StreamingResponse
import uvicorn
from pydantic import BaseModel
from loguru import logger

# for parler
try:
    from parler_tts import ParlerTTSForConditionalGeneration
    from transformers import AutoTokenizer, logging
    import torch
    import soundfile as sf
    logging.set_verbosity_error()
    has_parler_tts = True
except ImportError:
    logger.info("No parler support found")
    has_parler_tts = False

from openedai import OpenAIStub, BadRequestError

xtts = None
args = None
app = OpenAIStub()

class xtts_wrapper():
#neubsi
    def unload(self):
        logger.info("Unloading model")
        if self.xtts is not None:
            import torch, gc
            del self.xtts
            self.xtts = None
            gc.collect()
            torch.cuda.empty_cache()
#~neubsi
    def __init__(self, model_name, device):
        self.model_name = model_name
#neubsi
        self.device = device
#~neubsi
        self.xtts = TTS(model_name=model_name, progress_bar=False).to(device)

    def tts(self, text, speaker_wav, speed, language):
        tf, file_path = tempfile.mkstemp(suffix='.wav')
#neubsi
        global timer
        global time_s
        if timer is not None:
            timer.cancel() # Cancel any existing timer
        if self.xtts is None:
            logger.info("Loading model")
            self.xtts = TTS(model_name=self.model_name, progress_bar=False).to(self.device)
#~neubsi
        file_path = self.xtts.tts_to_file(
            text=text,
            language=language,
            speaker_wav=speaker_wav,
            speed=speed,
            file_path=file_path,
        )
#neubsi#
        logger.info("Setting unload timer")
        timer = threading.Timer(time_s, self.unload)
        timer.start()
#~neubsi
        os.unlink(file_path)
        return tf

#neubsi
timer = None
time_s = 300
#~neubsi

class parler_tts():
    def __init__(self, model_name, device):
        self.model_name = model_name
        self.model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tts(self, text, description):
        input_ids = self.tokenizer(description, return_tensors="pt").input_ids.to(self.model.device)
        prompt_input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to(self.model.device)

        generation = self.model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
        audio_arr = generation.cpu().numpy().squeeze()
        
        tf, file_path = tempfile.mkstemp(suffix='.wav')
        sf.write(file_path, audio_arr, self.model.config.sampling_rate)
        os.unlink(file_path)
        return tf


def default_exists(filename: str):
    if not os.path.exists(filename):
        basename, ext = os.path.splitext(filename)
        default = f"{basename}.default{ext}"
        
        logger.info(f"{filename} does not exist, setting defaults from {default}")

        with open(default, 'r') as from_file:
            with open(filename, 'w') as to_file:
                to_file.write(from_file.read())

# Read pre process map on demand so it can be changed without restarting the server
def preprocess(raw_input):
    logger.debug(f"preprocess: before: {[raw_input]}")
    default_exists('config/pre_process_map.yaml')
    with open('config/pre_process_map.yaml', 'r', encoding='utf8') as file:
        pre_process_map = yaml.safe_load(file)
        for a, b in pre_process_map:
            raw_input = re.sub(a, b, raw_input)
    
    raw_input = raw_input.strip()
    logger.debug(f"preprocess: after: {[raw_input]}")
    return raw_input

# Read voice map on demand so it can be changed without restarting the server
def map_voice_to_speaker(voice: str, model: str):
    default_exists('config/voice_to_speaker.yaml')
    with open('config/voice_to_speaker.yaml', 'r', encoding='utf8') as file:
        voice_map = yaml.safe_load(file)
        try:
            m = voice_map[model][voice]['model']
            s = voice_map[model][voice]['speaker']
            l = voice_map[model][voice].get('language', 'en')

        except KeyError as e:
            raise BadRequestError(f"Error loading voice: {voice}, KeyError: {e}", param='voice')
        
        return (m, s, l)

class GenerateSpeechRequest(BaseModel):
    model: str = "tts-1" # or "tts-1-hd"
    input: str
    voice: str = "alloy"  # alloy, echo, fable, onyx, nova, and shimmer
    response_format: str = "mp3" # mp3, opus, aac, flac
    speed: float = 1.0 # 0.25 - 4.0

def build_ffmpeg_args(response_format, input_format, sample_rate):
    # Convert the output to the desired format using ffmpeg
    if input_format == 'raw':
        ffmpeg_args = ["ffmpeg", "-loglevel", "error", "-f", "s16le", "-ar", sample_rate, "-ac", "1", "-i", "-"]
    else:
        ffmpeg_args = ["ffmpeg", "-loglevel", "error", "-f", "WAV", "-i", "-"]
    
    if response_format == "mp3":
        ffmpeg_args.extend(["-f", "mp3", "-c:a", "libmp3lame", "-ab", "64k"])
    elif response_format == "opus":
        ffmpeg_args.extend(["-f", "ogg", "-c:a", "libopus"])
    elif response_format == "aac":
        ffmpeg_args.extend(["-f", "adts", "-c:a", "aac", "-ab", "64k"])
    elif response_format == "flac":
        ffmpeg_args.extend(["-f", "flac", "-c:a", "flac"])

    return ffmpeg_args

@app.post("/v1/audio/speech", response_class=StreamingResponse)
async def generate_speech(request: GenerateSpeechRequest):
    global xtts, args
    if len(request.input) < 1:
        raise BadRequestError("Empty Input", param='input')

    input_text = preprocess(request.input)

    if len(input_text) < 1:
        raise BadRequestError("Input text empty after preprocess.", param='input')

    model = request.model
    voice = request.voice
    response_format = request.response_format
    speed = request.speed

    # Set the Content-Type header based on the requested format
    if response_format == "mp3":
        media_type = "audio/mpeg"
    elif response_format == "opus":
        media_type = "audio/ogg;codecs=opus"
    elif response_format == "aac":
        media_type = "audio/aac"
    elif response_format == "flac":
        media_type = "audio/x-flac"

    ffmpeg_args = None
    tts_io_out = None

    # Use piper for tts-1, and if xtts_device == none use for all models.
    if model == 'tts-1' or args.xtts_device == 'none':
        piper_model, speaker, not_used_language = map_voice_to_speaker(voice, 'tts-1')
        tts_args = ["piper", "--model", str(piper_model), "--data-dir", "voices", "--download-dir", "voices", "--output-raw"]
        if speaker:
            tts_args.extend(["--speaker", str(speaker)])
        if speed != 1.0:
            tts_args.extend(["--length-scale", f"{1.0/speed}"])

        tts_proc = subprocess.Popen(tts_args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        tts_proc.stdin.write(bytearray(input_text.encode('utf-8')))
        tts_proc.stdin.close()
        tts_io_out = tts_proc.stdout
        ffmpeg_args = build_ffmpeg_args(response_format, input_format="raw", sample_rate="22050")

    # Use xtts for tts-1-hd
    elif model == 'tts-1-hd':
        tts_model, speaker, language = map_voice_to_speaker(voice, 'tts-1-hd')

        if xtts is not None and xtts.model_name != tts_model:
            import torch, gc
            del xtts
            xtts = None
            gc.collect()
            torch.cuda.empty_cache()

        if 'parler-tts' in tts_model and has_parler_tts:
            if xtts is None:
                xtts = parler_tts(tts_model, device=args.xtts_device)

            ffmpeg_args = build_ffmpeg_args(response_format, input_format="WAV", sample_rate=str(xtts.model.config.sampling_rate))

            if speed != 1:
                ffmpeg_args.extend(["-af", f"atempo={speed}"]) 

            tts_io_out = xtts.tts(text=input_text, description=speaker)

        else:
            if xtts is None:
                xtts = xtts_wrapper(tts_model, device=args.xtts_device)

            ffmpeg_args = build_ffmpeg_args(response_format, input_format="WAV", sample_rate="24000")

            # tts speed doesn't seem to work well
            if speed < 0.5:
                speed = speed / 0.5
                ffmpeg_args.extend(["-af", "atempo=0.5"]) 
            if speed > 1.0:
                ffmpeg_args.extend(["-af", f"atempo={speed}"]) 
                speed = 1.0

            tts_io_out = xtts.tts(text=input_text, speaker_wav=speaker, speed=speed, language=language)
    else:
        raise BadRequestError("No such model, must be tts-1 or tts-1-hd.", param='model')

    # Pipe the output from piper/xtts to the input of ffmpeg
    ffmpeg_args.extend(["-"])
    ffmpeg_proc = subprocess.Popen(ffmpeg_args, stdin=tts_io_out, stdout=subprocess.PIPE)

    return StreamingResponse(content=ffmpeg_proc.stdout, media_type=media_type)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description='OpenedAI Speech API Server',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('--xtts_device', action='store', default="cuda", help="Set the device for the xtts model. The special value of 'none' will use piper for all models.")
    parser.add_argument('--preload', action='store', default=None, help="Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use.")
    parser.add_argument('-P', '--port', action='store', default=8000, type=int, help="Server tcp port")
    parser.add_argument('-H', '--host', action='store', default='0.0.0.0', help="Host to listen on, Ex. 0.0.0.0")
    parser.add_argument('-L', '--log-level', default="INFO", choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], help="Set the log level")

    args = parser.parse_args()

    logger.remove()
    logger.add(sink=sys.stderr, level=args.log_level)

    if args.xtts_device != "none":
        from TTS.api import TTS

    if args.preload:
        if 'parler-tts' in args.preload:
            xtts = parler_tts(args.preload, device=args.xtts_device)
        else:
            xtts = xtts_wrapper(args.preload, device=args.xtts_device)

    app.register_model('tts-1')
    app.register_model('tts-1-hd')

    uvicorn.run(app, host=args.host, port=args.port)

thank you

AMD GPU

As the title suggests, is there any chance that this project could work with PyTorch for AMD GPUs?

no amd64 docker image for 0.18.1 and latest tags

There's no AMD64 docker image available for 0.18.1 and latest. Not showing in the releases either. Last AMD64 docker image is still 0.17.2

See docker pull output:

docker pull ghcr.io/matatonic/openedai-speech:0.18.1
0.18.1: Pulling from matatonic/openedai-speech
no matching manifest for linux/amd64 in the manifest list entries

docker pull ghcr.io/matatonic/openedai-speech:latest
latest: Pulling from matatonic/openedai-speech
no matching manifest for linux/amd64 in the manifest list entries

Keep Piper alive

After some time of inactivity it takes very long to generate audio again. Am i doing something wrong?

Stack: Windows 11, WSL 2, Docker (gpu)

Error with custom voice

Hi, I have added voices/glados.onnx and voices/glados.onnx.json from: https://github.com/dnhkng/GlaDOS
I have also added glados: model: voices/glados.onnx speaker: 163 to config/voice_to_speaker.yaml and config/voice_to_speaker.default.yaml

The voice is not generated and the server prints: INFO: 192.168.64.113:37818 - "POST /v1/audio/speech HTTP/1.1" 200 OK Traceback (most recent call last): File "/root/audio/openedai-speech/.venv/bin/piper", line 8, in <module> sys.exit(main()) ^^^^^^ File "/root/audio/openedai-speech/.venv/lib/python3.11/site-packages/piper/__main__.py", line 126, in main for audio_bytes in audio_stream: File "/root/audio/openedai-speech/.venv/lib/python3.11/site-packages/piper/voice.py", line 123, in synthesize_stream_raw yield self.synthesize_ids_to_raw( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/audio/openedai-speech/.venv/lib/python3.11/site-packages/piper/voice.py", line 166, in synthesize_ids_to_raw audio = self.session.run( ^^^^^^^^^^^^^^^^^ File "/root/audio/openedai-speech/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run return self._sess.run(output_names, input_feed, run_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Invalid input name: sid

OS: debian 12
python: 3.11
curl: curl http://192.168.73.78:8000/v1/audio/speech -H "Content-Type: application/json" -d '{ "model": "tts-1", "input": "The quick brown fox jumped over the lazy dog.", "voice": "glados", "response_format": "mp3", "speed": 1.0 }' > speech.mp3

Thank you.

[Feature Request] Add new voice by API

Hello sir, I really thanks for your support about this project.
First I want to say this is good project.
But Can you help us add api to add new voice,
may be I can help you if you point me which I can change to support it.

v0.17.3 - No module named 'yaml'

b5d0daf

I now get this error that causes openedai-speech to restart over and over within Docker desktop:

Traceback (most recent call last):
File "/app/speech.py", line 12, in <module>
import yaml
ModuleNotFoundError: No module named 'yaml'

Here's how I have incorporated opendai-speech into a docker-compose.yml file (with other services):

services:
  openedai-speech-min:
    image: ghcr.io/matatonic/openedai-speech-min
    container_name: openedai-speech-min
    ports:
      - "8000:8000"
    env_file:
      - openedai-speech-min.env
    volumes:
      - .tts-voices:/app/voices # bind mount
      - .tts-config:/app/config # bind mount
    labels:
      - "com.centurylinklabs.watchtower.enable=true"
    restart: unless-stopped

openedai-speech-min.env:

# Openedai-speech
TTS_HOME=voices
HF_HOME=voices
#PRELOAD_MODEL=xtts
#PRELOAD_MODEL=xtts_v2.0.2
#PRELOAD_MODEL=parler-tts/parler_tts_mini_v0.1

Edit: This was solved very quickly. I just forgot to come back and update my comment. Thank you! :)

ARM (Apple Silicon) Image?

On my macOS device:

docker pull ghcr.io/matatonic/openedai-speech:latest
latest: Pulling from matatonic/openedai-speech
no matching manifest for linux/arm64/v8 in the manifest list entries

Looks like the current docker containers are only built for AMD. Is it possible to build for ARM?

Silent failure to generate speech

I am experiencing a silent failure to generate speech. The resulting .mp3 file is always empty, containing 1 second of silence. I have tried both the Docker installation and the non-Docker installation on my baremetal Ubuntu server and on Ubuntu running in WSL2 on my Windows 10 machine.

Steps to Reproduce:

  1. Clone the repository.
  2. Follow the Docker installation instructions.
  3. docker compose -f docker-compose.min.yml up
  4. Run the following command:
  "model": "tts-1",
  "input": "The quick brown fox jumped over the lazy dog.",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}' > speech.mp3

Expected Result:
An .mp3 file with the generated speech.

Actual Result:
An .mp3 file containing 1 second of silence.

Additional Information:
When changing the TTS model to tts-1-hd, the resulting file has 2 seconds of silence instead of 1.

Logs:
The debug logs appear normal and show a successful response:

 ✔ Container openedai-speech-server-1  Created                                                                     0.0s
Attaching to server-1
server-1  | INFO:     Started server process [13]
server-1  | INFO:     Waiting for application startup.
server-1  | INFO:     Application startup complete.
server-1  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
server-1  | 2024-07-12 02:33:50.090 | DEBUG    | openedai:log_requests:120 - Request path: /v1/audio/speech
server-1  | 2024-07-12 02:33:50.091 | DEBUG    | openedai:log_requests:121 - Request method: POST
server-1  | 2024-07-12 02:33:50.091 | DEBUG    | openedai:log_requests:122 - Request headers: Headers({'host': 'localhost:8000', 'user-agent': 'curl/7.81.0', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '158'})
server-1  | 2024-07-12 02:33:50.091 | DEBUG    | openedai:log_requests:123 - Request query params:
server-1  | 2024-07-12 02:33:50.091 | DEBUG    | openedai:log_requests:124 - Request body: b'{\n    "model": "tts-1",\n    "input": "The quick brown fox jumped over the lazy dog.",\n    "voice": "alloy",\n    "response_format": "mp3",\n    "speed": 1.0\n  }'
server-1  | 2024-07-12 02:33:50.099 | DEBUG    | openedai:log_requests:128 - Response status code: 200
server-1  | 2024-07-12 02:33:50.099 | DEBUG    | openedai:log_requests:129 - Response headers: MutableHeaders({'content-type': 'audio/mpeg'})
server-1  | INFO:     172.19.0.1:49232 - "POST /v1/audio/speech HTTP/1.1" 200 OK

Configuration Files:
docker-compose.yml:

  server:
    build:
      dockerfile: Dockerfile
    image: ghcr.io/matatonic/openedai-speech
    env_file: speech.env
    ports:
      - "8000:8000"
    volumes:
      - ./voices:/app/voices
      - ./config:/app/config
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

speech.env:

HF_HOME=voices
PRELOAD_MODEL=xtts
#PRELOAD_MODEL=xtts_v2.0.2
EXTRA_ARGS=--log-level DEBUG
#USE_ROCM=1

Please let me know if there's any additional information I should include. Thank you for looking into this issue!

Better error handling of invalid parameters.

Here, I had by chance the non existing voice 'alls'

server-1  |   File "/app/speech.py", line 141, in generate_speech
server-1  |     piper_model, speaker = map_voice_to_speaker(voice, 'tts-1')
server-1  |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
server-1  |   File "/app/speech.py", line 90, in map_voice_to_speaker
server-1  |     return (voice_map[model][voice]['model'], voice_map[model][voice]['speaker'])
server-1  |             ~~~~~~~~~~~~~~~~^^^^^^^
server-1  | KeyError: 'alls'

Custom piper voice generates empty mp3

I trained a onnx model with piper. Tested and working well.
When adding to openedai-speech as a speaker, I just get an empty mp3

voice_to_speaker.yml

nocare:
    model: voices/en_US-nocare-medium.onnx
    speaker: # default

image

curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "nocare",
    "response_format": "mp3",
    "speed": 1.0
  }' > speech.mp3

I'm using the non-gpu docker image and I'm not really sure where I can find any kind of log output. Any help is appreciated

Install fails on Windows // Deepspeed fails to install, No module named 'TTS' , piper-phonemize~=1.0.0 not available

When trying to run speech. py I am getting this error:

speech.py", line 333, in <module>
    from TTS.tts.configs.xtts_config import XttsConfig
ModuleNotFoundError: No module named 'TTS'

I am on Windows 11 with python 3.11.9

I really don t have a lot of experience with python and running programs, but what I did was:

git clone repo
create a virtual environment .venv
activate said virtual environment .venv\Scripts\Activate
pip install -r requirements.txt
run speech.py - getting TTS module error
go back to virtual environment and install TTS
get above error

Also an error when installing deepspeed with pip install -r requirements.txt

Collecting deepspeed (from -r requirements.txt (line 6))
  Using cached deepspeed-0.14.4.tar.gz (1.3 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [23 lines of output]
      [WARNING] Unable to import torch, pre-compiling ops will be disabled. Please visit https://pytorch.org/ to see how to properly install torch on your system.
       [WARNING]  unable to import torch, please install it if you want to pre-compile any deepspeed ops.
      DS_BUILD_OPS=1
      Traceback (most recent call last):
        File "F:\Project\LLM\openedai-speech\.venv\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
          main()
        File "F:\Project\LLM\openedai-speech\.venv\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "F:\Project\LLM\openedai-speech\.venv\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\USER\AppData\Local\Temp\pip-build-env-3vt8r0ws\overlay\Lib\site-packages\setuptools\build_meta.py", line 327, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\USER\AppData\Local\Temp\pip-build-env-3vt8r0ws\overlay\Lib\site-packages\setuptools\build_meta.py", line 297, in _get_build_requires
          self.run_setup()
        File "C:\Users\USER\AppData\Local\Temp\pip-build-env-3vt8r0ws\overlay\Lib\site-packages\setuptools\build_meta.py", line 497, in run_setup
          super().run_setup(setup_script=setup_script)
        File "C:\Users\USER\AppData\Local\Temp\pip-build-env-3vt8r0ws\overlay\Lib\site-packages\setuptools\build_meta.py", line 313, in run_setup
          exec(code, locals())
        File "<string>", line 149, in <module>
      AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

I made sure torch was installed under the same virtual environment.

Coqui Xttsv2 doesn't work on Mac GPU

Hey,
I love this project, but when adding a custom voice and calling it with the API i get this error:

{"message":"Error loading voice: josie, KeyError: 'josie'","code":400,"type":"BadRequestError","param":"voice"}%

here is the voice_to_speaker.yaml config file, the wav file exists and is in the specified path:

tts-1-hd:
  josie:
    model: xtts_v2.0.2
    speaker: voices/josie.wav
    language: auto
    enable_text_splitting: True
    length_penalty: 1.0
    repetition_penalty: 10
    speed: 1.0
    temperature: 0.75
    top_k: 50
    top_p: 0.85
    comment: J.O.S.I.E.'s voice is a calm yet profetional and smooth style with a litle flirty tone.

Here is the logs:

2024-08-15 16:49:39 server-1  | INFO:     Started server process [17]
2024-08-15 16:49:39 server-1  | INFO:     Waiting for application startup.
2024-08-15 16:49:39 server-1  | INFO:     Application startup complete.
2024-08-15 16:49:39 server-1  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2024-08-15 16:49:49 server-1  | INFO:     192.168.65.1:47932 - "POST /v1/audio/speech HTTP/1.1" 422 Unprocessable Entity
2024-08-15 16:49:55 server-1  | 2024-08-15 14:49:55.018 | INFO     | openedai:openai_statuserror_handler:106 - BadRequestError(message="Error loading voice: josie, KeyError: 'josie'", code=400, param=voice)
2024-08-15 16:49:55 server-1  | INFO:     192.168.65.1:33620 - "POST /v1/audio/speech HTTP/1.1" 400 Bad Request

Docker compose up failing (Mac M2)

Running on a Mac M2..

docker compose up
[+] Running 0/1
⠏ server Pulling 1.0s
[+] Building 3.6s (10/12) docker:desktop-linux
=> [server internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 496B 0.0s
=> [server internal] load metadata for docker.io/library/python:3.11-slim 0.4s
=> [server internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [server stage-0 1/8] FROM docker.io/library/python:3.11-slim@sha256:6d2502238109c929569ae99355e28890c438cb11bc88ef02cd189c173b3db07c 0.0s
=> [server internal] load build context 0.0s
=> => transferring context: 963B 0.0s
=> CACHED [server stage-0 2/8] RUN apt-get update && apt-get install --no-install-recommends -y curl git ffmpeg 0.0s
=> CACHED [server stage-0 3/8] RUN mkdir -p /app/voices 0.0s
=> CACHED [server stage-0 4/8] WORKDIR /app 0.0s
=> CACHED [server stage-0 5/8] COPY *.txt /app/ 0.0s
=> ERROR [server stage-0 6/8] RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt 3.2s

[server stage-0 6/8] RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt:
0.768 Collecting git+https://github.com/huggingface/parler-tts.git (from -r requirements.txt (line 11))
0.768 Cloning https://github.com/huggingface/parler-tts.git to /tmp/pip-req-build-dhep97ob
0.769 Running command git clone --filter=blob:none --quiet https://github.com/huggingface/parler-tts.git /tmp/pip-req-build-dhep97ob
1.933 Resolved https://github.com/huggingface/parler-tts.git to commit be2acc26bce06ae868c7d956ee1708e33e189dd4
1.938 Installing build dependencies: started
2.496 Installing build dependencies: finished with status 'done'
2.496 Getting requirements to build wheel: started
2.576 Getting requirements to build wheel: finished with status 'done'
2.576 Installing backend dependencies: started
2.852 Installing backend dependencies: finished with status 'done'
2.852 Preparing metadata (pyproject.toml): started
2.927 Preparing metadata (pyproject.toml): finished with status 'done'
3.016 Collecting fastapi (from -r requirements.txt (line 1))
3.017 Using cached fastapi-0.111.0-py3-none-any.whl.metadata (25 kB)
3.056 Collecting uvicorn (from -r requirements.txt (line 2))
3.056 Using cached uvicorn-0.29.0-py3-none-any.whl.metadata (6.3 kB)
3.074 Collecting piper-tts==1.2.0 (from -r requirements.txt (line 4))
3.075 Using cached piper_tts-1.2.0-py3-none-any.whl.metadata (776 bytes)
3.099 ERROR: Could not find a version that satisfies the requirement onnxruntime-gpu (from versions: none)
3.099 ERROR: No matching distribution found for onnxruntime-gpu


failed to solve: process "/bin/sh -c pip install -r requirements.txt" did not complete successfully: exit code: 1

Dockerfile does not follow standard practice for RUN commands, leading to large image size and amount of layers

RUN apt-get update && apt-get install --no-install-recommends -y curl ffmpeg
RUN if [ "$TARGETPLATFORM" != "linux/amd64" ]; then apt-get install --no-install-recommends -y build-essential ; fi
RUN if [ "$TARGETPLATFORM" != "linux/amd64" ]; then curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y ; fi
RUN apt-get clean && rm -rf /var/lib/apt/lists/*

This creates 4 layers, and all of those files and lists are kept because they were removed and installed in separate layers.

Just convert it to a single RUN and use a heredoc for nice syntax on top:

RUN <<EOR
apt-get update
apt-get install --no-install-recommends -y curl ffmpeg
if [ "$TARGETPLATFORM" != "linux/amd64" ]; then 
	apt-get install --no-install-recommends -y build-essential
fi
if [ "$TARGETPLATFORM" != "linux/amd64" ]; then 
	curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
fi
apt-get clean && rm -rf /var/lib/apt/lists/*
EOR

Cant curl a fine-tuned xtts-v2 model

Hi there!
Been struggling for a couple of days with this.

I set up the model in voice_to_speaker.yml as it's specified in the readme:

tts-1-hd:
  argentinian:
    model: argentinian
    model_path: voices/argentinian
    speaker: voices/argentinian/sample.wav

and copy the files in the specified folder.

I then deploy the API on Ubuntu with docker compose.

When I try it out with curl:

curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "model": "argentinian",
    "input": "The quick brown fox jumped over the lazy dog.",
    "response_format": "mp3",
    "speed": 1.0
  }' > speech.mp3

Which yields the error BadRequestError(message='No such model, must be tts-1 or tts-1-hd.', code=400, param=model)

If I try with that model name, like so:

curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "model": "tts-1-hd",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "argentinian",
    "response_format": "mp3",
    "speed": 1.0
  }' > speech.mp3

I get BadRequestError(message="Error loading voice: argentinian, KeyError: 'argentinian'", code=400, param=voice)

I keep banging my head with this. What am I missing?

New voice TTS-1 FR

Hello

I'm struggling to find out how to add french voices.

What I have done:

#!/bin/sh
models=${*:-"fr_FR-tom-medium"} # en_US-ryan-high
piper --update-voices --data-dir voices --download-dir voices --model x 2> /dev/null
for i in $models ; do
    [ ! -e "./$i.onnx" ] && piper --data-dir ./ --download-dir ./ --model $i < /dev/null > /dev/null
done

2 files are downloaded (same place as the others)
fr_FR-tom-medium.onnx
fr_FR-tom-medium.onnx.json
(same rights as the other files 0644)

In voice_to_speaker.yaml, I have added

tts-1:
  tom:
    model: voices/fr_FR-tom-medium.onnx
    speaker: # default speaker

When I'm trying (with openwebui to use a model:
alloy
The name alloy is auto-completed.
model is running.
Voice is ok

tom
The name til is NOT auto-completed.
model is NOT running.
No voice
and nothing in log :0

Thanks for help

Langdetect is 100 times slower than fasttext-langdetect

Tested using ipython on python 3.11

In [7]: %timeit ftlangdetect.detect("bonjour, je viens de me reveiller")
18.4 μs ± 4.8 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit langdetect.detect("bonjour, je viens de me reveiller")
21.3 ms ± 9.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

One gotcha is that ftlangdetect crashes if there is a newline so what I did on some other project is simply replace nexlines by a space for example.

https://pypi.org/project/fasttext-langdetect/

Piper voices not working.

I'm getting an error when trying to use a custom voice with Piper. I'm using docker and custom/default Coqui voices work and default Piper voices work. The files are downloaded as well. Tried a couple different voices like en_GB alba & southern_english_female. I tried using my legacy nvidia driver (cuda warning raised) and without GPU. Let me know if you need more info.

Heres an excerpt from voice_to_speaker.yaml...

  shimmer:
    model: voices/en_US-libritts_r-medium.onnx
    speaker: 163
  alba:
    model: voices/en_GB-alba-medium.onnx
    speaker: 1 # default speaker
  southern:
    model: voices/en_GB-southern_english_female-low.onnx
    speaker: 1 # default speakeren_US-kristin-medium
  kristin:
    model: voices/en_US-kristin-medium.onnx
    speaker: # default speaker

Heres the log error from docker....

`onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Invalid input name: sid

INFO: 172.18.3.1:46844 - "POST /v1/audio/speech HTTP/1.1" 200 OK

Traceback (most recent call last):

File "/usr/local/bin/piper", line 8, in

sys.exit(main())

         ^^^^^^

File "/usr/local/lib/python3.11/site-packages/piper/main.py", line 126, in main

for audio_bytes in audio_stream:

File "/usr/local/lib/python3.11/site-packages/piper/voice.py", line 123, in synthesize_stream_raw

yield self.synthesize_ids_to_raw(

      ^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.11/site-packages/piper/voice.py", line 166, in synthesize_ids_to_raw

audio = self.session.run(

        ^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run

return self._sess.run(output_names, input_feed, run_options)

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Question: why removing cuda support for piper?

Hi, i saw that in this commit you removed on purpose the cuda support for piper because apparently it didn't work for your setup?

I did a few test and it seemed to worked fine for me and was wondering if you could consider re adding support for it.

For me it was as simple as pip install onnxruntime-gpu then add --piper to the subprocess call. It could stay opt in and need an argument in the voice yaml for example.

What do you think? Am I missing something as to why it should be removed? Is it the size of the image?

Thanks!

Streaming is unavailable when invoking the API

Thank you for this great project. When I use xtts as the speech engine and try to stream audio playback, I observe that the code outputs the audio to the player only after the entire audio synthesis is completed. However, when I directly call OpenAI's TTS API, I can play the audio during the synthesis process.

Docker Build Error

ERROR: Could not build wheels for TTS, python-crfsuite, sudachipy, which is required to install pyproject.toml-based projects

Dockerfile:11
--------------------
   9 |     
  10 |     COPY requirements.txt /app/
  11 | >>> RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
  12 |     
  13 |     COPY speech.py openedai.py say.py *.sh README.md LICENSE /app/
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install -r requirements.txt" did not complete successfully: exit code: 1

Fedora Linux arm64 v8

Nvidia GPU RTX: 3060 not being utilized in the docker image

even though I have a working 3060, with cuda and all set up on my main machine, when I am running a docker using the latest image of opendai-speech, after first time, every timethe system starts in the logs i can see this error/warning:

image

and then the model only uses cpu which is very slow, additionally from the exec when i run nvidia-smi:
image

and when i try to see the cuda status in python I am again getting the same error:
image

CUDA acceleration broken

FROM python:3.11-slim in the non-minimal docker file is not enough to get Cuda support.
Consider this: 2.3.0-cuda12.1-cudnn8-runtime as TTS requires torch>2.1
Yes that will make image huge (+3.5GB) but I think there is no slim cuda images.

CUDNN_STATUS_NOT_SUPPORTED

Everything seems to be fine. Integration with OpenWebUI is working great !
I can see cuda usage when generating tts.

But there is this CUDNN warning on the logs:

server-1  |  > Text splitted to sentences.
server-1  | ['I can help you with that!']
server-1  |  > Processing time: 1.2529590129852295
server-1  |  > Real-time factor: 0.5532631014964016
server-1  | INFO:     172.18.0.1:44926 - "POST /v1/audio/speech HTTP/1.1" 200 OK
server-1  | /usr/local/lib/python3.11/site-packages/torch/nn/modules/conv.py:306: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
server-1  |   return F.conv1d(input, weight, bias, self.stride,
server-1  |  > Text splitted to sentences.

Should this be ignored ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.