Comments (12)
Thank you @trungkienbkhn I didn't know that it supports only 16000 will try it asap and let you know if I face any issues and yes it takes chunks of audio that is how I want to process a huge interview voice lines.
from faster-whisper.
@arunman1kandan , I think yes with FW large-v3
from faster-whisper.
@arunman1kandan , yes. Feel free to open a new issue if you encounter any other problems.
from faster-whisper.
@arunman1kandan , You can try the RTX 3090 as shown in the example in the [readme](You can try the RTX 3090 as shown in the example in the readme and refer to the benchmarks mentioned there) and refer to the benchmarks mentioned there.
from faster-whisper.
@arunman1kandan , hello. Could you show your code ?
I think maybe the data you passed to the FW model is incorrect or too short. You should pass data as numpy nddarray to FW model.
Below is my example, it's not realtime but it also uses sounddevice for recording:
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
print("Recording started")
duration = 10
sample_rate = 16000
audio_data = sd.rec(
int(sample_rate * duration), samplerate=sample_rate, channels=1, dtype=np.float32
)
sd.wait()
audio_data = audio_data.squeeze()
print("Recording stopped")
model = WhisperModel("tiny", device="cuda")
segments, info = model.transcribe(audio_data, word_timestamps=True)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Hope that it's helpful for you.
from faster-whisper.
Thanks for helping @trungkienbkhn. Sure the below is the code that I use for real-time transcription using faster-whisper
import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel
class April_Transcriber :
def __init__(self , model_size = "large-v3" , sample_rate = 44100) :
self.model_size = model_size
self.sample_rate = sample_rate
self.model = WhisperModel(model_size , device="cuda" , compute_type="int8_float16")
self.is_recording = False
def on_press(self , key):
if key==keyboard.Key.space:
if not self.is_recording:
self.is_recording = True
print("Go ahead I am listening")
def on_release(self , key):
if key==keyboard.Key.space:
if self.is_recording:
self.is_recording = False
print("Processing...")
return False
def record_audio(self):
recording = np.array([] , dtype="float64").reshape(0 , 2)
frames_per_buffer = int(self.sample_rate * 0.1)
with keyboard.Listener(on_press=self.on_press , on_release= self.on_release) as listener :
while True:
if self.is_recording:
chunk = sd.rec(frames_per_buffer , samplerate=self.sample_rate , channels=2 , dtype="float64")
sd.wait()
recording = np.vstack([recording , chunk])
if not self.is_recording and len(recording) > 0 :
break
listener.join()
return recording
def save_temp_audio(self , recording):
temp_file = tempfile.NamedTemporaryFile(delete=False , suffix=".wav")
write(temp_file.name , self.sample_rate , recording)
return temp_file.name
def transcribe_audio(self , path):
segments , info = self.model.transcribe(path , beam_size=5)
print("Detected language '%s' with probablity of '%f'" %(info.language , info.language_probability))
full_transcription = ""
for segment in segments:
print(segment.text)
full_transcription+=segment.text + " "
os.remove(path)
return full_transcription
def run(self):
print("Please hold spacebar to record")
while True:
recording = self.record_audio()
file_path = self.save_temp_audio(recording)
self.transcribe_audio(file_path)
print("Press space to record again")
if __name__ == "__main__":
transcriber = April_Transcriber()
transcriber.run()
from faster-whisper.
@arunman1kandan , the default sample_rate of whisper model is 16000, not 44100. I edited your code as below:
import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel
class April_Transcriber:
def __init__(self , model_size = "large-v3" , sample_rate=16000) :
self.model_size = model_size
self.sample_rate = sample_rate
self.model = WhisperModel(model_size , device="cuda" , compute_type="int8_float16")
self.is_recording = False
def on_press(self, key):
if key == keyboard.Key.space:
if not self.is_recording:
self.is_recording = True
print("Go ahead I am listening")
def on_release(self, key):
if key == keyboard.Key.space:
if self.is_recording:
self.is_recording = False
print("Processing...")
return False
def record_audio(self):
recording = []
duration = 5
# setting duration to 0.1 is too short to detect audio
frames_per_buffer = int(self.sample_rate * duration)
with keyboard.Listener(on_press=self.on_press, on_release=self.on_release) as listener:
while True:
if self.is_recording:
chunk = sd.rec(frames_per_buffer, samplerate=self.sample_rate, channels=1, dtype=np.float32)
sd.wait()
recording = chunk.squeeze()
if not self.is_recording and len(recording) > 0:
break
listener.join()
return recording
def save_temp_audio(self, recording):
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
write(temp_file.name, self.sample_rate, recording)
return temp_file.name
def transcribe_audio(self, path):
segments, info = self.model.transcribe(path)
print("Detected language '%s' with probablity of '%f'" % (info.language, info.language_probability))
full_transcription = ""
for segment in segments:
print(segment.text)
full_transcription += segment.text + " "
# os.remove(path)
return full_transcription
def run(self):
print("Please hold spacebar to record")
while True:
recording = self.record_audio()
# file_path = self.save_temp_audio(recording)
self.transcribe_audio(recording)
print("Press space to record again")
if __name__ == "__main__":
transcriber = April_Transcriber()
transcriber.run()
But I think that your idea is not realtime. It's just transcribing small audio chunk with each press of the spacebar.
You can try to use this example of the sounddevice module for realtime implementation.
from faster-whisper.
@trungkienbkhn Thanks mate it works like a charm I just checked it out. Also the example you provided with for real-time transcription does it work in noisy environments too like the larger-v3 model can detect ambient noises?
from faster-whisper.
Alright bro @trungkienbkhn thanks for your help and shall i close this issue?
from faster-whisper.
Hello there! @trungkienbkhn I am all good with the model but the transcription seems to be slower even for a 5s audio which u mentioned earlier. It takes me approximately around 5-6 seconds to process the audio now i am not sure if that's normal cause it's the first time I am trying any local models for Spech-To-Text. Here's the code :
import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
class April_Transcriber:
def __init__(self , model_size = "large-v3" , sample_rate=16000) :
self.model_size = model_size
self.sample_rate = sample_rate
self.model = WhisperModel(model_size , device="cuda" , compute_type="float16")
self.is_recording = False
def on_press(self, key):
if key == keyboard.Key.space:
if not self.is_recording:
self.is_recording = True
print("Go ahead I am listening")
def on_release(self, key):
if key == keyboard.Key.space:
if self.is_recording:
self.is_recording = False
print("Processing...")
return False
def record_audio(self):
recording = []
duration = 4
# setting duration to 0.1 is too short to detect audio
frames_per_buffer = int(self.sample_rate * duration)
with keyboard.Listener(on_press=self.on_press, on_release=self.on_release) as listener:
while True:
if self.is_recording:
chunk = sd.rec(frames_per_buffer, samplerate=self.sample_rate, channels=1, dtype=np.float32)
sd.wait()
recording = chunk.squeeze()
if not self.is_recording and len(recording) > 0:
break
listener.join()
return recording
def save_temp_audio(self, recording):
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
write(temp_file.name, self.sample_rate, recording)
return temp_file.name
def transcribe_audio(self, path):
segments, info = self.model.transcribe(path)
full_transcription = ""
for segment in segments:
print(segment.text)
full_transcription += segment.text + " "
# os.remove(path)
return full_transcription
def run(self):
print("Please hold spacebar to record")
while True:
recording = self.record_audio()
# file_path = self.save_temp_audio(recording)
return self.transcribe_audio(recording)
if __name__ == "__main__":
transcriber = April_Transcriber()
transcriber.run()
and here's my laptop's specs :
Intel i5 1200H (12 Cores and 16 Logical Processors)
16GB DDR4(3200MHZ) Ram
Nvidia 3050 Mobile GPU(Dedicated-4GB and Shared-8GB memory , Total memory-12GB)
PS:- I also have a Intel Iris GPU
from faster-whisper.
@arunman1kandan , If you want to reduce transcription time, you can try to use a smaller model (tiny, small, ...). But the trade-off is that the quality will decrease a bit. Or another way is to use high-end gpu to increase calculation speed (eg A100, V100, ...)
from faster-whisper.
@trungkienbkhn Sure mate, but is there like a base gpu that like acts best for the thus claimed speed?
from faster-whisper.
Related Issues (20)
- Gibberish Outputs HOT 3
- "Thanks for watching" shows up repeatedly HOT 6
- Faster-whisper issue with the latest NVIDIA 55x series drivers HOT 1
- Will it support c++/c just like whisper.cpp? HOT 2
- With `faster-distil-whisper-large-v3` or `large-v3`, `transcribe` instruction is ignored (it translates instead) HOT 3
- ON arm64 'for segment in segments' run a lot of time HOT 2
- Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives HOT 2
- Having issue in decoding audio chunks properly for fasterWhisper transcribe func
- finetuning encounter multiple errors on the 2nd step (Fine-tuning XTTS Encoder) HOT 1
- clip_timestamps does not work across multiple files [faster-whisper 1.0.2] HOT 3
- What are the ways to improve the speed of continuously recognizing multiple audio files? HOT 1
- Silero-VAD Meta Hallucinations HOT 1
- Limited GPU Utilization with NVIDIA RTX 4000 Ada Gen HOT 13
- The Japanese conversion to the back has always been show thanks for listening ご視聴ありがとうございました what is the reason
- Batch process available? HOT 2
- Word-level timestamps are off after hotwords is setted HOT 1
- Finetuning with Dora HOT 1
- Is there a method or parameter that can filter out noise that is not human voice? HOT 3
- The VAD parameters and default values in the source code is inconsistent with the description in README.md HOT 1
- how can I get more accurate timestamps? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from faster-whisper.