GithubHelp home page GithubHelp logo

Comments (12)

arunman1kandan avatar arunman1kandan commented on June 11, 2024 1

Thank you @trungkienbkhn I didn't know that it supports only 16000 will try it asap and let you know if I face any issues and yes it takes chunks of audio that is how I want to process a huge interview voice lines.

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024 1

@arunman1kandan , I think yes with FW large-v3

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024 1

@arunman1kandan , yes. Feel free to open a new issue if you encounter any other problems.

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024 1

@arunman1kandan , You can try the RTX 3090 as shown in the example in the [readme](You can try the RTX 3090 as shown in the example in the readme and refer to the benchmarks mentioned there) and refer to the benchmarks mentioned there.

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024

@arunman1kandan , hello. Could you show your code ?
I think maybe the data you passed to the FW model is incorrect or too short. You should pass data as numpy nddarray to FW model.
Below is my example, it's not realtime but it also uses sounddevice for recording:

import numpy as np
import sounddevice as sd

from faster_whisper import WhisperModel

print("Recording started")
duration = 10
sample_rate = 16000
audio_data = sd.rec(
    int(sample_rate * duration), samplerate=sample_rate, channels=1, dtype=np.float32
)
sd.wait()
audio_data = audio_data.squeeze()
print("Recording stopped")

model = WhisperModel("tiny", device="cuda")
segments, info = model.transcribe(audio_data, word_timestamps=True)
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Hope that it's helpful for you.

from faster-whisper.

arunman1kandan avatar arunman1kandan commented on June 11, 2024

Thanks for helping @trungkienbkhn. Sure the below is the code that I use for real-time transcription using faster-whisper

import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel

class April_Transcriber :
    def __init__(self , model_size = "large-v3" , sample_rate = 44100) : 
        self.model_size = model_size
        self.sample_rate = sample_rate
        self.model = WhisperModel(model_size , device="cuda" , compute_type="int8_float16")
        self.is_recording  = False

    def on_press(self , key):
        if key==keyboard.Key.space:
            if not self.is_recording:
                self.is_recording = True
                print("Go ahead I am listening")

    def on_release(self , key):
        if key==keyboard.Key.space:
            if self.is_recording:
                self.is_recording = False
                print("Processing...")
                return False
            
    def record_audio(self):
        recording = np.array([] , dtype="float64").reshape(0 , 2)
        frames_per_buffer = int(self.sample_rate * 0.1)

        with keyboard.Listener(on_press=self.on_press , on_release= self.on_release) as listener : 
            while True:
                if self.is_recording:
                    chunk = sd.rec(frames_per_buffer , samplerate=self.sample_rate , channels=2 , dtype="float64")
                    sd.wait()
                    recording = np.vstack([recording , chunk])
                
                if not self.is_recording and len(recording) > 0 :
                    break
            listener.join()

        return recording
    
    def save_temp_audio(self , recording):
        temp_file = tempfile.NamedTemporaryFile(delete=False , suffix=".wav")
        write(temp_file.name , self.sample_rate , recording)
        return temp_file.name
    
    def transcribe_audio(self , path):
        segments , info = self.model.transcribe(path , beam_size=5)
        print("Detected language '%s' with probablity of '%f'" %(info.language , info.language_probability))
        full_transcription = ""
        for segment in segments:
            print(segment.text)
            full_transcription+=segment.text + " "
        os.remove(path)
        return full_transcription
    
    def run(self):
        print("Please hold spacebar to record")
        while True:
            recording = self.record_audio()
            file_path = self.save_temp_audio(recording)
            self.transcribe_audio(file_path)
            print("Press space to record again")

if __name__ == "__main__":
    transcriber  = April_Transcriber()
    transcriber.run()
    

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024

@arunman1kandan , the default sample_rate of whisper model is 16000, not 44100. I edited your code as below:

import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel


class April_Transcriber:
    def __init__(self , model_size = "large-v3" , sample_rate=16000) : 
        self.model_size = model_size
        self.sample_rate = sample_rate
        self.model = WhisperModel(model_size , device="cuda" , compute_type="int8_float16")
        self.is_recording = False

    def on_press(self, key):
        if key == keyboard.Key.space:
            if not self.is_recording:
                self.is_recording = True
                print("Go ahead I am listening")

    def on_release(self, key):
        if key == keyboard.Key.space:
            if self.is_recording:
                self.is_recording = False
                print("Processing...")
                return False

    def record_audio(self):
        recording = []
        duration = 5
        # setting duration to 0.1 is too short to detect audio
        frames_per_buffer = int(self.sample_rate * duration)

        with keyboard.Listener(on_press=self.on_press, on_release=self.on_release) as listener:
            while True:
                if self.is_recording:
                    chunk = sd.rec(frames_per_buffer, samplerate=self.sample_rate, channels=1, dtype=np.float32)
                    sd.wait()
                    recording = chunk.squeeze()

                if not self.is_recording and len(recording) > 0:
                    break
            listener.join()

        return recording

    def save_temp_audio(self, recording):
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        write(temp_file.name, self.sample_rate, recording)
        return temp_file.name

    def transcribe_audio(self, path):
        segments, info = self.model.transcribe(path)
        print("Detected language '%s' with probablity of '%f'" % (info.language, info.language_probability))
        full_transcription = ""
        for segment in segments:
            print(segment.text)
            full_transcription += segment.text + " "
        # os.remove(path)
        return full_transcription

    def run(self):
        print("Please hold spacebar to record")
        while True:
            recording = self.record_audio()
            # file_path = self.save_temp_audio(recording)
            self.transcribe_audio(recording)
            print("Press space to record again")


if __name__ == "__main__":
    transcriber = April_Transcriber()
    transcriber.run()

But I think that your idea is not realtime. It's just transcribing small audio chunk with each press of the spacebar.
You can try to use this example of the sounddevice module for realtime implementation.

from faster-whisper.

arunman1kandan avatar arunman1kandan commented on June 11, 2024

@trungkienbkhn Thanks mate it works like a charm I just checked it out. Also the example you provided with for real-time transcription does it work in noisy environments too like the larger-v3 model can detect ambient noises?

from faster-whisper.

arunman1kandan avatar arunman1kandan commented on June 11, 2024

Alright bro @trungkienbkhn thanks for your help and shall i close this issue?

from faster-whisper.

arunman1kandan avatar arunman1kandan commented on June 11, 2024

Hello there! @trungkienbkhn I am all good with the model but the transcription seems to be slower even for a 5s audio which u mentioned earlier. It takes me approximately around 5-6 seconds to process the audio now i am not sure if that's normal cause it's the first time I am trying any local models for Spech-To-Text. Here's the code :

import sounddevice as sd
import numpy as np
from pynput import keyboard
from scipy.io.wavfile import write
import tempfile
import os
from faster_whisper import WhisperModel

os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"


class April_Transcriber:
    def __init__(self , model_size = "large-v3" , sample_rate=16000) : 
        self.model_size = model_size
        self.sample_rate = sample_rate
        self.model = WhisperModel(model_size , device="cuda" , compute_type="float16")
        self.is_recording = False

    def on_press(self, key):
        if key == keyboard.Key.space:
            if not self.is_recording:
                self.is_recording = True
                print("Go ahead I am listening")

    def on_release(self, key):
        if key == keyboard.Key.space:
            if self.is_recording:
                self.is_recording = False
                print("Processing...")
                return False

    def record_audio(self):
        recording = []
        duration = 4
        # setting duration to 0.1 is too short to detect audio
        frames_per_buffer = int(self.sample_rate * duration)

        with keyboard.Listener(on_press=self.on_press, on_release=self.on_release) as listener:
            while True:
                if self.is_recording:
                    chunk = sd.rec(frames_per_buffer, samplerate=self.sample_rate, channels=1, dtype=np.float32)
                    sd.wait()
                    recording = chunk.squeeze()

                if not self.is_recording and len(recording) > 0:
                    break
            listener.join()

        return recording

    def save_temp_audio(self, recording):
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        write(temp_file.name, self.sample_rate, recording)
        return temp_file.name

    def transcribe_audio(self, path):
        segments, info = self.model.transcribe(path)
        full_transcription = ""
        for segment in segments:
            print(segment.text)
            full_transcription += segment.text + " "
        # os.remove(path)
        return full_transcription

    def run(self):
        print("Please hold spacebar to record")
        while True:
            recording = self.record_audio()
            # file_path = self.save_temp_audio(recording)
            return self.transcribe_audio(recording)


if __name__ == "__main__":
    transcriber = April_Transcriber()
    transcriber.run()

and here's my laptop's specs :

Intel i5 1200H (12 Cores and 16 Logical Processors)
16GB DDR4(3200MHZ) Ram
Nvidia 3050 Mobile GPU(Dedicated-4GB and Shared-8GB memory , Total memory-12GB)
PS:- I also have a Intel Iris GPU

image

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024

@arunman1kandan , If you want to reduce transcription time, you can try to use a smaller model (tiny, small, ...). But the trade-off is that the quality will decrease a bit. Or another way is to use high-end gpu to increase calculation speed (eg A100, V100, ...)

from faster-whisper.

arunman1kandan avatar arunman1kandan commented on June 11, 2024

@trungkienbkhn Sure mate, but is there like a base gpu that like acts best for the thus claimed speed?

from faster-whisper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.