GithubHelp home page GithubHelp logo

koljab / realtimestt Goto Github PK

View Code? Open in Web Editor NEW
805.0 19.0 62.0 176 KB

A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.

License: MIT License

Python 94.73% Batchfile 1.59% JavaScript 2.99% HTML 0.69%
python realtime speech-to-text

realtimestt's Introduction

RealtimeSTT

Easy-to-use, low-latency speech-to-text library for realtime applications

About the Project

RealtimeSTT listens to the microphone and transcribes voice into text.

Hint: Check out Linguflex, the original project from which RealtimeSTT is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

It's ideal for:

  • Voice Assistants
  • Applications requiring fast and precise speech-to-text conversion
RealtimeSTT.mp4

Updates

Latest Version: v0.1.15

See release history.

Hint: Since we use the multiprocessing module now, ensure to include the if __name__ == '__main__': protection in your code to prevent unexpected behavior, especially on platforms like Windows. For a detailed explanation on why this is important, visit the official Python documentation on multiprocessing.

Features

  • Voice Activity Detection: Automatically detects when you start and stop speaking.
  • Realtime Transcription: Transforms speech to text in real-time.
  • Wake Word Activation: Can activate upon detecting a designated wake word.

Hint: Check out RealtimeTTS, the output counterpart of this library, for text-to-voice capabilities. Together, they form a powerful realtime audio wrapper around large language models.

Tech Stack

This library uses:

  • Voice Activity Detection
    • WebRTCVAD for initial voice activity detection.
    • SileroVAD for more accurate verification.
  • Speech-To-Text
  • Wake Word Detection

These components represent the "industry standard" for cutting-edge applications, providing the most modern and effective foundation for building high-end solutions.

Installation

pip install RealtimeSTT

This will install all the necessary dependencies, including a CPU support only version of PyTorch.

Although it is possible to run RealtimeSTT with a CPU installation only (use a small model like "tiny" or "base" in this case) you will get way better experience using:

GPU Support with CUDA (recommended)

Additional steps are needed for a GPU-optimized installation. These steps are recommended for those who require better performance and have a compatible NVIDIA GPU.

Note: To check if your NVIDIA GPU supports CUDA, visit the official CUDA GPUs list.

To use RealtimeSTT with GPU support via CUDA please follow these steps:

  1. Install NVIDIA CUDA Toolkit 11.8:

  2. Install NVIDIA cuDNN 8.7.0 for CUDA 11.x:

    • Visit NVIDIA cuDNN Archive.
    • Click on "Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
    • Download and install the software.
  3. Install ffmpeg:

    Note: Installation of ffmpeg might not actually be needed to operate RealtimeSTT *thanks to jgilbert2017 for pointing this out

    You can download an installer for your OS from the ffmpeg Website.

    Or use a package manager:

  4. Install PyTorch with CUDA support:

    pip uninstall torch
    pip install torch==2.2.2+cu118 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118

Quick Start

Basic usage:

Manual Recording

Start and stop of recording are manually triggered.

recorder.start()
recorder.stop()
print(recorder.text())

Automatic Recording

Recording based on voice activity detection.

with AudioToTextRecorder() as recorder:
    print(recorder.text())

When running recorder.text in a loop it is recommended to use a callback, allowing the transcription to be run asynchronously:

def process_text(text):
    print (text)
    
while True:
    recorder.text(process_text)

Wakewords

Keyword activation before detecting voice. Write the comma-separated list of your desired activation keywords into the wake_words parameter. You can choose wake words from these list: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator.

recorder = AudioToTextRecorder(wake_words="jarvis")

print('Say "Jarvis" then speak.')
print(recorder.text())

Callbacks

You can set callback functions to be executed on different events (see Configuration) :

def my_start_callback():
    print("Recording started!")

def my_stop_callback():
    print("Recording stopped!")

recorder = AudioToTextRecorder(on_recording_start=my_start_callback,
                               on_recording_stop=my_stop_callback)

Feed chunks

If you don't want to use the local microphone set use_microphone parameter to false and provide raw PCM audiochunks in 16-bit mono (samplerate 16000) with this method:

recorder.feed_audio(audio_chunk)

Shutdown

You can shutdown the recorder safely by using the context manager protocol:

with AudioToTextRecorder() as recorder:
    [...]

Or you can call the shutdown method manually (if using "with" is not feasible):

recorder.shutdown()

Testing the Library

The test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.

Test scripts depending on RealtimeTTS library may require you to enter your azure service region within the script. When using OpenAI-, Azure- or Elevenlabs-related demo scripts the API Keys should be provided in the environment variables OPENAI_API_KEY, AZURE_SPEECH_KEY and ELEVENLABS_API_KEY (see RealtimeTTS)

  • simple_test.py

    • Description: A "hello world" styled demonstration of the library's simplest usage.
  • realtimestt_test.py

    • Description: Showcasing live-transcription.
  • wakeword_test.py

    • Description: A demonstration of the wakeword activation.
  • translator.py

    • Dependencies: Run pip install openai realtimetts.
    • Description: Real-time translations into six different languages.
  • openai_voice_interface.py

    • Dependencies: Run pip install openai realtimetts.
    • Description: Wake word activated and voice based user interface to the OpenAI API.
  • advanced_talk.py

    • Dependencies: Run pip install openai keyboard realtimetts.
    • Description: Choose TTS engine and voice before starting AI conversation.
  • minimalistic_talkbot.py

    • Dependencies: Run pip install openai realtimetts.
    • Description: A basic talkbot in 20 lines of code.

The example_app subdirectory contains a polished user interface application for the OpenAI API based on PyQt5.

Configuration

Initialization Parameters for AudioToTextRecorder

When you initialize the AudioToTextRecorder class, you have various options to customize its behavior.

General Parameters

  • model (str, default="tiny"): Model size or path for transcription.

    • Options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.
    • Note: If a size is provided, the model will be downloaded from the Hugging Face Hub.
  • language (str, default=""): Language code for transcription. If left empty, the model will try to auto-detect the language. Supported language codes are listed in Whisper Tokenizer library.

  • compute_type (str, default="default"): Specifies the type of computation to be used for transcription. See Whisper Quantization

  • input_device_index (int, default=0): Audio Input Device Index to use.

  • gpu_device_index (int, default=0): GPU Device Index to use. The model can also be loaded on multiple GPUs by passing a list of IDs (e.g. [0, 1, 2, 3]).

  • on_recording_start: A callable function triggered when recording starts.

  • on_recording_stop: A callable function triggered when recording ends.

  • on_transcription_start: A callable function triggered when transcription starts.

  • ensure_sentence_starting_uppercase (bool, default=True): Ensures that every sentence detected by the algorithm starts with an uppercase letter.

  • ensure_sentence_ends_with_period (bool, default=True): Ensures that every sentence that doesn't end with punctuation such as "?", "!" ends with a period

  • use_microphone (bool, default=True): Usage of local microphone for transcription. Set to False if you want to provide chunks with feed_audio method.

  • spinner (bool, default=True): Provides a spinner animation text with information about the current recorder state.

  • level (int, default=logging.WARNING): Logging level.

  • handle_buffer_overflow (bool, default=True): If set, the system will log a warning when an input overflow occurs during recording and remove the data from the buffer.

  • beam_size (int, default=5): The beam size to use for beam search decoding.

  • initial_prompt (str or iterable of int, default=None): Initial prompt to be fed to the transcription models.

  • suppress_tokens (list of int, default=[-1]): Tokens to be suppressed from the transcription output.

  • on_recorded_chunk: A callback function that is triggered when a chunk of audio is recorded. Submits the chunk data as parameter.

  • debug_mode (bool, default=False): If set, the system prints additional debug information to the console.

Real-time Transcription Parameters

Note: When enabling realtime description a GPU installation is strongly advised. Using realtime transcription may create high GPU loads.

  • enable_realtime_transcription (bool, default=False): Enables or disables real-time transcription of audio. When set to True, the audio will be transcribed continuously as it is being recorded.

  • realtime_model_type (str, default="tiny"): Specifies the size or path of the machine learning model to be used for real-time transcription.

    • Valid options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.
  • realtime_processing_pause (float, default=0.2): Specifies the time interval in seconds after a chunk of audio gets transcribed. Lower values will result in more "real-time" (frequent) transcription updates but may increase computational load.

  • on_realtime_transcription_update: A callback function that is triggered whenever there's an update in the real-time transcription. The function is called with the newly transcribed text as its argument.

  • on_realtime_transcription_stabilized: A callback function that is triggered whenever there's an update in the real-time transcription and returns a higher quality, stabilized text as its argument.

  • beam_size_realtime (int, default=3): The beam size to use for real-time transcription beam search decoding.

Voice Activation Parameters

  • silero_sensitivity (float, default=0.6): Sensitivity for Silero's voice activity detection ranging from 0 (least sensitive) to 1 (most sensitive). Default is 0.6.

  • silero_use_onnx (bool, default=False): Enables usage of the pre-trained model from Silero in the ONNX (Open Neural Network Exchange) format instead of the PyTorch format. Default is False. Recommended for faster performance.

  • webrtc_sensitivity (int, default=3): Sensitivity for the WebRTC Voice Activity Detection engine ranging from 0 (least aggressive / most sensitive) to 3 (most aggressive, least sensitive). Default is 3.

  • post_speech_silence_duration (float, default=0.2): Duration in seconds of silence that must follow speech before the recording is considered to be completed. This ensures that any brief pauses during speech don't prematurely end the recording.

  • min_gap_between_recordings (float, default=1.0): Specifies the minimum time interval in seconds that should exist between the end of one recording session and the beginning of another to prevent rapid consecutive recordings.

  • min_length_of_recording (float, default=1.0): Specifies the minimum duration in seconds that a recording session should last to ensure meaningful audio capture, preventing excessively short or fragmented recordings.

  • pre_recording_buffer_duration (float, default=0.2): The time span, in seconds, during which audio is buffered prior to formal recording. This helps counterbalancing the latency inherent in speech activity detection, ensuring no initial audio is missed.

  • on_vad_detect_start: A callable function triggered when the system starts to listen for voice activity.

  • on_vad_detect_stop: A callable function triggered when the system stops to listen for voice activity.

Wake Word Parameters

  • wake_words (str, default=""): Wake words for initiating the recording. Multiple wake words can be provided as a comma-separated string. Supported wake words are: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator

  • wake_words_sensitivity (float, default=0.6): Sensitivity level for wake word detection (0 for least sensitive, 1 for most sensitive).

  • wake_word_activation_delay (float, default=0): Duration in seconds after the start of monitoring before the system switches to wake word activation if no voice is initially detected. If set to zero, the system uses wake word activation immediately.

  • wake_word_timeout (float, default=5): Duration in seconds after a wake word is recognized. If no subsequent voice activity is detected within this window, the system transitions back to an inactive state, awaiting the next wake word or voice activation.

  • on_wakeword_detected: A callable function triggered when a wake word is detected.

  • on_wakeword_timeout: A callable function triggered when the system goes back to an inactive state after when no speech was detected after wake word activation.

  • on_wakeword_detection_start: A callable function triggered when the system starts to listen for wake words

  • on_wakeword_detection_end: A callable function triggered when stopping to listen for wake words (e.g. because of timeout or wake word detected)

Contribution

Contributions are always welcome!

License

MIT

Author

Kolja Beigel
Email: [email protected]
GitHub

realtimestt's People

Contributors

hannesdelbeke avatar koljab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

realtimestt's Issues

Imput device: what you hear

Good afternoon, is there any possibility to install an imput device on what you hear?
I am considering this program to transcribe in real time the entire conversation between several people I hear, is it possible?

Multiprocessing issue on macOS

Hi,

Thank you for your great work! I am running 0.1.11 with Python 3.10 on latest macOS, and I am getting this error when running simple_test.py, which is similar to #7 and #29

Could you please share any workaround for this? Great thanks!

Say something... RealTimeSTT: root - ERROR - Unhandled exeption in _recording_worker: Exception in thread Thread-1 (_recording_worker): Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/RealtimeSTT/audio_recorder.py", line 994, in _recording_worker while (self.audio_queue.qsize() > File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/queues.py", line 126, in qsize return self._maxsize - self._sem._semlock._get_value()

Pytorch version mismatch

Servus, I followed all the steps and propose a tiny tweak to the Readme.

The regular pip install: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 creates a version mismatch between torch & torchaudio (on a brand new VM, configured from scratch).

RuntimeError: Detected that PyTorch and TorchAudio were compiled with different CUDA versions. PyTorch has CUDA version 11.8 whereas TorchAudio has CUDA version 11.7. Please install the TorchAudio version that matches your PyTorch version.

I recommend pinning the versions to fix the issue (if it appears): pip install torch==2.0.1+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
Source/inspiration.

Code works amazingly well, kudos 👍

I tried the German version and that works very well too (had to mute the logs). You have done something great here Kolja! GG.

Screenshot from 2023-09-16 12-26-22

Works on both 2060 & 4090, Ubuntu 22.04.

OSError: [Errno -9997] Invalid sample rate

OS: Linux Arch
Audio system: Pipewire (with alsa, pulse audio etc plugins).

It seems it is only possible for PyAudio to open an audio device with Pipewire at its default sample rate, not 16000Hz.

Would it be possible to run RealtimeSTT at a higher frequency than 16000Hz?

Support quantized models to save memory

First, thanks for creating a fantastic project! I was looking for a way to run Whisper or some other speech-to-text model in realtime. I found several potential solutions but this one is clearly the best, especially for implementing custom applications on top.

I noticed that faster-whisper supports quantized models but RealtimeSTT currently doesn't expose that option. With int8 quantization, models take up much less VRAM (or RAM, if run on CPU only). The quality of model output may suffer a little bit, but I think it's still a worthwhile optimization when memory is tight.

I have a laptop with an integrated NVIDIA GeForce MX150 GPU that only has 2GB VRAM. I was able to run the small model without problems (with tiny as the realtime model), but the medium and larger models gave a CUDA out of memory error.

To enable quantization, I tweaked the initialization of WhisperModel here

self.realtime_model_type = faster_whisper.WhisperModel(
model_size_or_path=self.realtime_model_type,
device='cuda' if torch.cuda.is_available() else 'cpu'
)
and here
model = faster_whisper.WhisperModel(
model_size_or_path=model_path,
device='cuda' if torch.cuda.is_available() else 'cpu'
)

by adding the parameter compute_type='int8'. This resulted in quantized models and the medium model can now fit on my feeble GPU; sadly, the large-v2 model is still too big.

GPU VRAM requirements as reported by nvidia-smi with and without quantization of the main model (realtime model is always tiny with the same quantization applied as for the main model):

model default int8
tiny 542MiB 246MiB
base 914MiB 278MiB
small 1386MiB 532MiB
medium out of memory 980MiB
large-v2 out of memory out of memory

This could be exposed as an additional parameter compute_type for AudioToTextRecorder; or possibly two separate parameters, one for the realtime model and another for the main model. This parameter would then simply be passed as compute_type to the WhisperModel(s).

TTS

Impressive work - do you have any insight on applying the same methodology for TTS?

does it support client-server mode realtime STT?

Hi, KoljaB

Thanks for your contribution. I learned a lot although I'm still a newbie.
I have a application scenario :
A linux machine with Nvidia GPU, I want use it (as server) to realtime transtribe audio from macbook( as client) ;
Does this object support this application scenario?

Regards,
Snow

Scipy missing from requirements.txt?

I installed RealtimeSTT via pip (in a new and otherwise empty virtual environment) and tried to run a simple test script with from RealtimeSTT import AudioToTextRecorder as the only import. I then got a module missing error for scipy. I installed it via pip and then everyting worked fine (via CPU, as noted in the readme).

Does scipy need adding to the requirements.txt files?

Provide ID for data & transcribtion

Especially for slow devices (for example cpu) there is the problem of speech going on while transcription is happening. I want to add a "listen again" function using on_recorded_chunk saving the chunk and making this file available for listening on frontend. I want to map transcription and data later on since transcription is async. To be able to achieve this, I have to have the queue.

Can you please add arguments "transcription_id" as argument on_recorded_chunk and also recorder.text?

Thank you.

unable to run script

Hi there,

I've been desperate to try your script after I saw it on reddit (we had a brief chat), but I can't for the life of me figure out what's going on?

I've tried:
Running from the GH repo with pip install realtimestt
Running from the GH repo without pip install realtimestt
running in a different env just using pip install realtimestt
running your test scripts
running the most 'basic' vanilla script

Environment:
MacBook Pro
macOS Ventura Version 13.5.1 (22G90)
Apple M2 Max
Conda Environment (fresh)
ffmpeg installed with Conda
Python 3.11.5
Pip freeze dump:
av==10.0.0
certifi==2023.7.22
charset-normalizer==3.2.0
colorama==0.4.6
coloredlogs==15.0.1
ctranslate2==3.19.0
enum34==1.1.10
faster-whisper==0.8.0
filelock==3.12.4
flatbuffers==23.5.26
fsspec==2023.9.1
halo==0.0.31
huggingface-hub==0.17.1
humanfriendly==10.0
idna==3.4
Jinja2==3.1.2
log-symbols==0.0.14
MarkupSafe==2.1.3
mpmath==1.3.0
networkx==3.1
numpy==1.25.2
onnxruntime==1.15.1
packaging==23.1
protobuf==4.24.3
pvporcupine==1.9.5
PyAudio==0.2.13
PyYAML==6.0.1
requests==2.31.0
six==1.16.0
spinners==0.0.24
sympy==1.12
termcolor==2.3.0
tokenizers==0.13.3
torch==2.0.1
torchaudio==2.0.2
tqdm==4.66.1
typing_extensions==4.7.1
urllib3==2.0.4
webrtcvad==2.0.10

Console dump:
[ctranslate2] [thread 2542417] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
File "/.../test whisper.py", line 4, in
recorder = AudioToTextRecorder(spinner=True, language="en", model="tiny.en", level=logging.WARNING)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 246, in init
self.silero_vad_model, _ = torch.hub.load(

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/torch/hub.py", line 555, in load
repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/torch/hub.py", line 199, in _get_cache_or_reload
repo_owner, repo_name, ref = _parse_repo_info(github)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/torch/hub.py", line 142, in _parse_repo_info
with urlopen(f"https://github.com/{repo_owner}/{repo_name}/tree/main/"):

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 519, in open
response = self._open(req, data)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 496, in _call_chain
result = func(*args)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 1352, in do_open
r = h.getresponse()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/http/client.py", line 1378, in getresponse
response.begin()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/http/client.py", line 318, in begin
version, status, reason = self._read_status()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/http/client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/socket.py", line 706, in readinto
return self._sock.recv_into(b)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/ssl.py", line 1278, in recv_into
return self.read(nbytes, buffer)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/ssl.py", line 1134, in read
return self._sslobj.read(len, buffer)

KeyboardInterrupt
Exception ignored in: <function AudioToTextRecorder.del at 0x1523b23e0>
Traceback (most recent call last):
File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 894, in del
self.shutdown()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 397, in shutdown
self.recording_thread.join()

AttributeError: 'AudioToTextRecorder' object has no attribute 'recording_thread'

Would love some help here!

Thanks,

The Captain

Cuda Error

Followed the installation steps to run the repo. When I try running realtimestt_test.py, I get this runtime error:

root - ERROR - Unhandled exeption in _realtime_worker: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for 
execution on the device
Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\uber_\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner
    self.run()
  File "C:\Users\uber_\AppData\Local\Programs\Python\Python39\lib\threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 1302, in _realtime_worker
    self.realtime_transcription_text = " ".join(
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 1302, in <genexpr>       
    self.realtime_transcription_text = " ".join(
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\faster_whisper\transcribe.py", line 511, in generate_segments 
    encoder_output = self.encode(segment)
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\faster_whisper\transcribe.py", line 762, in encode
    return self.model.encode(features, to_cpu=to_cpu)
RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
Traceback (most recent call last):
    recorder.text(process_text)
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 882, in text
    self.wait_audio()
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 802, in wait_audio
    if (self.stop_recording_event.wait(timeout=0.02)):

I confirmed that Cuda 11.8 is installed using: nvcc -V and have the cuDNN (8.7.0) files as well. Torch is also available.

input pcm buffer_size issue

BUFFER_SIZE = 512
self.buffer_size = BUFFER_SIZE
feed_audio(self, chunk):
when I call feed_audio,the input data size is 640/768(16k,mono) from our realtime server,should I change the buffer_size (512) in the audio_recorder?
Thanks!

browser client example phrases repetition

I'm encountering slow real-time transcription and occasional repetition issues while using the browser client example (i didn't change anything in the script). It seems that the transcription process is significantly delayed, and certain phrases are repeated multiple times as if the same chunk of text is being transcribed repeatedly.

[Feature request] Abort execution

Hi! I found recorder.abort() function, and think it will be great to have something:

  1. Always listen for wake word in background, even if transcripting.
  2. If wake word detected interrupt transcription and make new session.
  3. Also will be great to have callback to do something on interruption.

Launches but does not display any text

Just spinner with text "speak now". Tried large-v3, small and tiny models, no difference. Mic working well, I just tested it by recording audio using python. It's probably trying, because CPU is heavily loaded, but there's no result.

MacOS Sonoma 14.4

There are no errors in console, just warning:
[2024-04-24 23:53:30.031] [ctranslate2] [thread 8069444] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

issue with microphone

dear

my card sound working fine but with your program i tried many change in my ubuntun pci sound configuration but its not working and giving me errors
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave

please any helps

macOS

Code:

import ssl

ssl._create_default_https_context = ssl._create_unverified_context
import torch

model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad", verbose=True)

if __name__ == '__main__':
    recorder = AudioToTextRecorder(spinner=False)

    print("Say something...")
    while (True):
        print(recorder.text(), end=" ", flush=True)

Error:

RealTimeSTT: root - ERROR - Unhandled exeption in _recording_worker: 
Exception in thread Thread-1 (_recording_worker):
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/hanxirui/workspace/python/DataScience/venv/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 667, in _recording_worker
    while self.audio_queue.qsize() > self.allowed_latency_limit:
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/queues.py", line 126, in qsize
    return self._maxsize - self._sem._semlock._get_value()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError```

Question: verify installation

I just installed RealtimeSTT on windows.

I downloaded and installed cudnn and ffmpeg but they are just archives with binaries.

I didn't add anything to my PATH but the realtimestt_test.py seems to work fine.

How do I verify that my installation is correct (what are cuddn/ffmpeg required by)?

Also, is there any direct way to verify that my GPU is being used?

Thanks

Record blocked while transcribing (no real async possible)

when .text(function) is called, microphone is blocked and it is not listening. Speech happening at the same time is not captured and is also not listed in the recorder.audio_queue.qsize()

So a basic customer journey example.

  • "Speak now"
  • User Speaks - "Recording"
  • User stops speaking. "Transcribing"
  • Transcribing disappears. Faster-Whisper is trying to transcribe using large-v3
  • User speaks while big transcription is happening Gets ignored
  • Result of first text appears. "Speak now"

I tried enable_realtime_transcription': True and False. Both has the problem.
I am using recorder.text(process_text) which according to docs async as soon as I provide a function in .text(). But it appears to be not that async.

Can you please solve this? The queue appears to be buggy and with slow gpu / cpu, there is guaranteed data loss due to race condition.

Thank you

Simple test not working

Hi, I'm trying to use this library but for me it doesnt seem to be working, I have suspicions that it is because of me installing the gpu variant, but not fully because the last step cudnn is installed does not have an installer on windows. Here is the error I get when running it:

Say something...
Process Process-1:
Traceback (most recent call last):
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 578, in _transcription_worker
transcription = " ".join(seg.text for seg in segments)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 578, in
transcription = " ".join(seg.text for seg in segments)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\faster_whisper\transcribe.py", line 508, in generate_segments
encoder_output = self.encode(segment)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\faster_whisper\transcribe.py", line 767, in encode
return self.model.encode(features, to_cpu=to_cpu)
RuntimeError: Library cublas64_12.dll is not found or cannot be loaded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\process.py", line 314, in _bootstrap
self.run()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 581, in _transcription_worker
except faster_whisper.WhisperError as e:
AttributeError: module 'faster_whisper' has no attribute 'WhisperError'
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\connection.py", line 312, in _recv_bytes
nread, err = ov.GetOverlappedResult(True)
BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\Users\rober\RealtimeSTT\tests\realtimestt_test.py", line 6, in
while (True): print(recorder.text(), end=" ", flush=True)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 825, in text
return self.transcribe()
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 777, in transcribe
status, result = self.parent_transcription_pipe.recv()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\connection.py", line 250, in recv
buf = self._recv_bytes()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\connection.py", line 321, in _recv_bytes
raise EOFError
EOFError

openai.ChatCompletion no longer supported

I ran tests\minimalistic_talkbot.py
RealTimeSTT 0.1.8
RealTimeTTS 0.3.4
openai 1.6.0

I got this error:

RealTimeSTT: root - WARNING - error in play() with engine azure:

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run openai migrate to automatically upgrade your codebase to use the 1.0.0 interface.

Alternatively, you can pin your installation to the old version, e.g. pip install openai==0.28

A detailed migration guide is available here: openai/openai-python#742

Traceback: Traceback (most recent call last):
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\text_to_stream.py", line 308, in play
for sentence in chunk_generator:
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\text_to_stream.py", line 552, in _synthesis_chunk_generator
for chunk in generator:
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\stream2sentence\stream2sentence.py", line 193, in generate_sentences
for char in _generate_characters(generator, log_characters):
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\stream2sentence\stream2sentence.py", line 85, in _generate_characters
for chunk in generator:
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\threadsafe_generators.py", line 237, in next
token = next(self.generator)
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\threadsafe_generators.py", line 147, in next
self._current_str = next(self._current_iterator)
File "R:\projects\RealtimeSTT\tests\minimalistic_talkbot.py", line 11, in generate
for chunk in openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages, stream=True):
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\openai\lib_old_api.py", line 39, in call
raise APIRemovedInV1(symbol=self._symbol)
openai.lib._old_api.APIRemovedInV1:

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run openai migrate to automatically upgrade your codebase to use the 1.0.0 interface.

Alternatively, you can pin your installation to the old version, e.g. pip install openai==0.28

A detailed migration guide is available here: openai/openai-python#742

Cannot run example browerclient

I have error when use the example of browser client folder:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/Users/hieunguyenminh/opt/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/hieunguyenminh/opt/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/hieunguyenminh/CODE ALL/TalkToListen/STT-VAD/venv/lib/python3.9/site-packages/RealtimeSTT/audio_recorder.py", line 985, in _recording_worker
    while (self.audio_queue.qsize() >
  File "/Users/hieunguyenminh/opt/anaconda3/lib/python3.9/multiprocessing/queues.py", line 126, in qsize
    return self._maxsize - self._sem._semlock._get_value()
NotImplementedError
  • I tried python 3.11 and 3.9 but they both has this. I wonder what version do you use that have this qsize() supported?
  • I want to make an app that connect client and server, what example do you recommend?

Thank you for creating this app, this helps me a lot!

main transcription model path

Hi! I've tried to find path where models are downloaded and maybe try to set another path.

code from example
model="small.en", language="en", wake_words="jarvis"

No output shown or Logs.

I ran the websocket server at host 0.0.0.0 and connected to it from my machine, it shows some errors and then says "Waiting for clients" and when it connects without problems but it doesn't show any output

└─ ... ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.front
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround40
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround41
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround50
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround51
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround71
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM dmix
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
[2024-04-15 17:33:11.582] [ctranslate2] [thread 854988] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
[2024-04-15 17:33:12.901] [ctranslate2] [thread 855019] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
└─ OK
waiting for clients
└─ OK. 
waiting for sentence
└─ ... 

In the client I see vad started. and it is stuck there.

Apple Neural Engine integration?

Thanks for this amazing work.

I have a mac and since it has neural engine to leverage, I was wondering if there is any way of integrating that to use this module. Would this be a possible feature addition?

Interval error

Almostly working fine, but I got an error several times when I was using this function AudioToTextRecorder.text(on_transcription_finished=on_transcription_finished)

The error:

Traceback (most recent call last):
  File "C:\ooba-voice\prod\main.py", line 32, in <module>
- transcribing    main()
  File "C:\ooba-voice\prod\main.py", line 27, in main
    recorder.text(on_transcription_finished=on_transcription_finished)
  File "C:\Users\name\AppData\Roaming\Python\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 543, in text
    threading.Thread(target=on_transcription_finished, args=(self.transcribe(),)).start()
  File "C:\Users\name\AppData\Roaming\Python\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 506, in transcribe
    self._set_state("transcribing")
  File "C:\Users\name\AppData\Roaming\Python\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 1004, in _set_state
    self.halo._interval = 50
AttributeError: 'NoneType' object has no attribute '_interval'

What is the problem?

And this has nothing to do with above one but it often stops recording while I'm still speaking. Could you teach me how to solve this problem too?
...and I cannot kill a process completely with Ctrl+C from windows 11 cmd

summary

  • AttributeError: 'NoneType' object has no attribute '_interval' (AudioToTextRecorder.text(on_transcription_finished=on_transcription_finished))
  • Stop recording while still speaking (AudioToTextRecorder.text(on_transcription_finished=on_transcription_finished))
  • Cannot kill a process complety with Ctrl+C from windows 11 cmd

Shutdown issue

Hello, I have problem with shutdown method when using microphone =False, it always stuck on

    logging.debug('Finishing recording thread')
    if self.recording_thread:
        self.recording_thread.join()

Example code:

if __name__ == '__main__':
    import pyaudio
    import threading
    from RealtimeSTT import AudioToTextRecorder
    import wave
    import time

    import logging


    recorder = None
    recorder_ready = threading.Event()

    recorder_config = {
        'spinner': False,
        'use_microphone': False,
        'model': "tiny.en",
        'language': 'en',
        'silero_sensitivity': 0.4,
        'webrtc_sensitivity': 2,
        'post_speech_silence_duration': 0.7,
        'min_length_of_recording': 0,
        'min_gap_between_recordings': 0
    }

    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000
    CHUNK = 1024

    REALTIMESTT = True


    def recorder_thread():
        global recorder
        print("Initializing RealtimeSTT...")
        recorder = AudioToTextRecorder(**recorder_config,level=logging.DEBUG)
        print("RealtimeSTT initialized")
        recorder_ready.set()
        while True:
            full_sentence = recorder.text()
            if full_sentence:
                print(f"\rSentence: {full_sentence}")




    recorder_thread = threading.Thread(target=recorder_thread)
    recorder_thread.start()
    recorder_ready.wait()
    with wave.open('Iiterviewing.wav', 'rb') as wav_file:
        assert wav_file.getnchannels() == CHANNELS
        assert wav_file.getsampwidth() == pyaudio.get_sample_size(FORMAT)
        assert wav_file.getframerate() == RATE
        data = wav_file.readframes(CHUNK)
        while data:
            time.sleep(0.1)
            recorder.feed_audio(data)
            data = wav_file.readframes(CHUNK)
    print("before")
    recorder.shutdown()
    print("after")


Add a "on_recorded" function OR fix on_recorded_chunk

on_recorded_chunk kinda ignores the voice activation process and is called as often as the CPU allows it per second, providing 1kb Data without any voice or whatsoever.

I kinda don't see any use case why this should be helpful. On the other hand, I guess a full data exporting function at the end of voice activation makes much more sense, since this
a) has real user data instead of white noise 1kb chunk
b) this real data can be f.e sent to an external Speech to Text Database or just to a server for later usage / further training.

Or am I thinking something wrong here?

Do I actually need NVIDIA CUDA 12 rather than 11.8?

I just tried the instructions for implementing GPU Support e.g. installing NVIDIA CUDA Toolkit 11.8 and NVIDIA cuDNN 8.7.0 for CUDA 11.x, specifically, as per the readme (which involves some manual file moving around and PATH environment updating on Windows as per Nvidia's instructions at https://docs.nvidia.com/deeplearning/cudnn/installation/windows.html - restarted since then to ensure PATH updated properly everywhere and confirmed it can be found in the proper place when in my venv via which cublas64_11.dll (I use Git bash for my terminal in VSCode, but where via cmd works, too) and then the pytorch uninstall/reinstall pip commands specifying cuda 11.8

Then I tried to run a super simple test script:

from RealtimeSTT import AudioToTextRecorder

def process_text(text):
  print(text, end=" ", flush=True)

if __name__ == '__main__':
  with AudioToTextRecorder(
    spinner=False,
    model="tiny.en",
    language="en",
    # enable_realtime_transcription=True,
    realtime_model_type="tiny.en"
  ) as recorder:
    print("Say something...")
    while True:
      recorder.text(process_text)

But got this error:

Exception: Library cublas64_12.dll is not found or cannot be loaded

The stack trace stops in RealtimeSTT so I manually hunted around in the .venv files in vscode for a reference to cuda 12. The only reference matching a regex of cublas(.*)12 in ./.venv/* is the METADATA file for PyTorch i.e. .venv\Lib\site-packages\torch-2.2.2+cu118.dist-info\METADATA . Specifically:

Metadata-Version: 2.1
Name: torch
Version: 2.2.2+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Download-URL: https://github.com/pytorch/pytorch/tags
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Keywords: pytorch,machine learning
Classifier: ...
...
Requires-Dist: nvidia-cuda-nvrtc-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu12 ==8.9.2.26 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu12 ==12.1.3.1 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu12 ==11.0.2.54 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu12 ==10.3.2.106 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu12 ==11.4.5.107 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu12 ==12.1.0.106 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu12 ==2.19.3 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
...

I'm fairly inexperienced poking around to resolve Python's specific dependency hell 😅 so I'm not sure if this represents the cause of the issue or not.

I tried the uninstall, then pip cache purge, and then the re-install in case a cahed wheel was the issue, but still have the problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.