I'm using the Faster-Whisper model for real-time speech-to-text tranion in a Pyt

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for helping <a class="user-mention notranslate" data-hovercard-type="user" data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Alright bro <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Hello there! <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Model not producing accurate transcriptions in Python about faster-whisper HOT 12 CLOSED

arunman1kandan commented on June 11, 2024

Model not producing accurate transcriptions in Python

from faster-whisper.

Comments (12)

arunman1kandan commented on June 11, 2024 1

Thank you @trungkienbkhn I didn't know that it supports only 16000 will try it asap and let you know if I face any issues and yes it takes chunks of audio that is how I want to process a huge interview voice lines.

from faster-whisper.

trungkienbkhn commented on June 11, 2024 1

@arunman1kandan , I think yes with FW large-v3

from faster-whisper.

trungkienbkhn commented on June 11, 2024 1

@arunman1kandan , yes. Feel free to open a new issue if you encounter any other problems.

from faster-whisper.

trungkienbkhn commented on June 11, 2024 1

@arunman1kandan , You can try the RTX 3090 as shown in the example in the [readme](You can try the RTX 3090 as shown in the example in the readme and refer to the benchmarks mentioned there) and refer to the benchmarks mentioned there.

from faster-whisper.

trungkienbkhn commented on June 11, 2024

@arunman1kandan , hello. Could you show your code ?
I think maybe the data you passed to the FW model is incorrect or too short. You should pass data as numpy nddarray to FW model.
Below is my example, it's not realtime but it also uses sounddevice for recording:

import numpy as np
import sounddevice as sd

from faster_whisper import WhisperModel

print("Recording started")
duration = 10
sample_rate = 16000
audio_data = sd.rec(
    int(sample_rate * duration), samplerate=sample_rate, channels=1, dtype=np.float32
)
sd.wait()
audio_data = audio_data.squeeze()
print("Recording stopped")

model = WhisperModel("tiny", device="cuda")
segments, info = model.transcribe(audio_data, word_timestamps=True)
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Hope that it's helpful for you.

from faster-whisper.

arunman1kandan commented on June 11, 2024

Thanks for helping @trungkienbkhn. Sure the below is the code that I use for real-time transcription using faster-whisper

import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel

class April_Transcriber :
    def __init__(self , model_size = "large-v3" , sample_rate = 44100) : 
        self.model_size = model_size
        self.sample_rate = sample_rate
        self.model = WhisperModel(model_size , device="cuda" , compute_type="int8_float16")
        self.is_recording  = False

    def on_press(self , key):
        if key==keyboard.Key.space:
            if not self.is_recording:
                self.is_recording = True
                print("Go ahead I am listening")

    def on_release(self , key):
        if key==keyboard.Key.space:
            if self.is_recording:
                self.is_recording = False
                print("Processing...")
                return False
            
    def record_audio(self):
        recording = np.array([] , dtype="float64").reshape(0 , 2)
        frames_per_buffer = int(self.sample_rate * 0.1)

        with keyboard.Listener(on_press=self.on_press , on_release= self.on_release) as listener : 
            while True:
                if self.is_recording:
                    chunk = sd.rec(frames_per_buffer , samplerate=self.sample_rate , channels=2 , dtype="float64")
                    sd.wait()
                    recording = np.vstack([recording , chunk])
                
                if not self.is_recording and len(recording) > 0 :
                    break
            listener.join()

        return recording
    
    def save_temp_audio(self , recording):
        temp_file = tempfile.NamedTemporaryFile(delete=False , suffix=".wav")
        write(temp_file.name , self.sample_rate , recording)
        return temp_file.name
    
    def transcribe_audio(self , path):
        segments , info = self.model.transcribe(path , beam_size=5)
        print("Detected language '%s' with probablity of '%f'" %(info.language , info.language_probability))
        full_transcription = ""
        for segment in segments:
            print(segment.text)
            full_transcription+=segment.text + " "
        os.remove(path)
        return full_transcription
    
    def run(self):
        print("Please hold spacebar to record")
        while True:
            recording = self.record_audio()
            file_path = self.save_temp_audio(recording)
            self.transcribe_audio(file_path)
            print("Press space to record again")

if __name__ == "__main__":
    transcriber  = April_Transcriber()
    transcriber.run()

from faster-whisper.

trungkienbkhn commented on June 11, 2024

@arunman1kandan , the default sample_rate of whisper model is 16000, not 44100. I edited your code as below:

import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel


class April_Transcriber:
    def __init__(self , model_size = "large-v3" , sample_rate=16000) : 
        self.model_size = model_size
        self.sample_rate = sample_rate
        self.model = WhisperModel(model_size , device="cuda" , compute_type="int8_float16")
        self.is_recording = False

    def on_press(self, key):
        if key == keyboard.Key.space:
            if not self.is_recording:
                self.is_recording = True
                print("Go ahead I am listening")

    def on_release(self, key):
        if key == keyboard.Key.space:
            if self.is_recording:
                self.is_recording = False
                print("Processing...")
                return False

    def record_audio(self):
        recording = []
        duration = 5
        # setting duration to 0.1 is too short to detect audio
        frames_per_buffer = int(self.sample_rate * duration)

        with keyboard.Listener(on_press=self.on_press, on_release=self.on_release) as listener:
            while True:
                if self.is_recording:
                    chunk = sd.rec(frames_per_buffer, samplerate=self.sample_rate, channels=1, dtype=np.float32)
                    sd.wait()
                    recording = chunk.squeeze()

                if not self.is_recording and len(recording) > 0:
                    break
            listener.join()

        return recording

    def save_temp_audio(self, recording):
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        write(temp_file.name, self.sample_rate, recording)
        return temp_file.name

    def transcribe_audio(self, path):
        segments, info = self.model.transcribe(path)
        print("Detected language '%s' with probablity of '%f'" % (info.language, info.language_probability))
        full_transcription = ""
        for segment in segments:
            print(segment.text)
            full_transcription += segment.text + " "
        # os.remove(path)
        return full_transcription

    def run(self):
        print("Please hold spacebar to record")
        while True:
            recording = self.record_audio()
            # file_path = self.save_temp_audio(recording)
            self.transcribe_audio(recording)
            print("Press space to record again")


if __name__ == "__main__":
    transcriber = April_Transcriber()
    transcriber.run()

But I think that your idea is not realtime. It's just transcribing small audio chunk with each press of the spacebar.
You can try to use this example of the sounddevice module for realtime implementation.

from faster-whisper.

arunman1kandan commented on June 11, 2024

@trungkienbkhn Thanks mate it works like a charm I just checked it out. Also the example you provided with for real-time transcription does it work in noisy environments too like the larger-v3 model can detect ambient noises?

from faster-whisper.

arunman1kandan commented on June 11, 2024

Alright bro @trungkienbkhn thanks for your help and shall i close this issue?

from faster-whisper.

arunman1kandan commented on June 11, 2024

Hello there! @trungkienbkhn I am all good with the model but the transcription seems to be slower even for a 5s audio which u mentioned earlier. It takes me approximately around 5-6 seconds to process the audio now i am not sure if that's normal cause it's the first time I am trying any local models for Spech-To-Text. Here's the code :

import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel

os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"


class April_Transcriber:
    def __init__(self , model_size = "large-v3" , sample_rate=16000) : 
        self.model_size = model_size
        self.sample_rate = sample_rate
        self.model = WhisperModel(model_size , device="cuda" , compute_type="float16")
        self.is_recording = False

    def on_press(self, key):
        if key == keyboard.Key.space:
            if not self.is_recording:
                self.is_recording = True
                print("Go ahead I am listening")

    def on_release(self, key):
        if key == keyboard.Key.space:
            if self.is_recording:
                self.is_recording = False
                print("Processing...")
                return False

    def record_audio(self):
        recording = []
        duration = 4
        # setting duration to 0.1 is too short to detect audio
        frames_per_buffer = int(self.sample_rate * duration)

        with keyboard.Listener(on_press=self.on_press, on_release=self.on_release) as listener:
            while True:
                if self.is_recording:
                    chunk = sd.rec(frames_per_buffer, samplerate=self.sample_rate, channels=1, dtype=np.float32)
                    sd.wait()
                    recording = chunk.squeeze()

                if not self.is_recording and len(recording) > 0:
                    break
            listener.join()

        return recording

    def save_temp_audio(self, recording):
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        write(temp_file.name, self.sample_rate, recording)
        return temp_file.name

    def transcribe_audio(self, path):
        segments, info = self.model.transcribe(path)
        full_transcription = ""
        for segment in segments:
            print(segment.text)
            full_transcription += segment.text + " "
        # os.remove(path)
        return full_transcription

    def run(self):
        print("Please hold spacebar to record")
        while True:
            recording = self.record_audio()
            # file_path = self.save_temp_audio(recording)
            return self.transcribe_audio(recording)


if __name__ == "__main__":
    transcriber = April_Transcriber()
    transcriber.run()

and here's my laptop's specs :

Intel i5 1200H (12 Cores and 16 Logical Processors)
16GB DDR4(3200MHZ) Ram
Nvidia 3050 Mobile GPU(Dedicated-4GB and Shared-8GB memory , Total memory-12GB)
PS:- I also have a Intel Iris GPU

from faster-whisper.

trungkienbkhn commented on June 11, 2024

@arunman1kandan , If you want to reduce transcription time, you can try to use a smaller model (tiny, small, ...). But the trade-off is that the quality will decrease a bit. Or another way is to use high-end gpu to increase calculation speed (eg A100, V100, ...)

from faster-whisper.

arunman1kandan commented on June 11, 2024

@trungkienbkhn Sure mate, but is there like a base gpu that like acts best for the thus claimed speed?

from faster-whisper.

Model not producing accurate transcriptions in Python about faster-whisper HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs