Comments (6)
Hey @ENDERFUN2 ,
Assume you have a box full of Legos and want to construct a spacecraft. To be used, all of the Legos should be loose and in the box.However, occasionally, a smaller box within the larger box may contain the Legos. This extra box is merely there; it contains no Legos.The sounddevice.rec function in this code is analogous to obtaining a Lego box. There could be an additional, empty box included (the extra dimension).It's like opening the large box and removing the smaller, empty one when you use the squeeze feature. It takes out the superfluous box so that all you have to work with is the audio data, or Legos.This is significant because the WhisperModel.transcribe method, which you use to construct the spaceship, is limited to working with loose Legos and not with boxes inside boxes. Squeezing ensures that everything operates as intended and removes the excess box.
This is as how I understood the squeeze works.
from faster-whisper.
@ENDERFUN2 , hello. To handle silence in recorded audio, you can try using the vad_filter
option.
To avoid saving audio to temp file, you should pass the audio data as numpy ndarray format to FW model. Below is my example:
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
print("Recording started")
duration = 10
sample_rate = 16000
audio_data = sd.rec(
int(sample_rate * duration), samplerate=sample_rate, channels=1, dtype=np.float32
)
sd.wait()
audio_data = audio_data.squeeze()
print("Recording stopped")
model = WhisperModel("tiny", device="cpu")
segments, info = model.transcribe(audio_data, word_timestamps=True)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
from faster-whisper.
Also along with it @ENDERFUN2 make sure you use 16000 as sampling rate and not any other as FW wouldn't support a sample rate other than 16000 as told by @trungkienbkhn.
from faster-whisper.
Okay, so vad_filter works perfectly. I don't know why I hadn't found it before...
Also, that sample code from @trungkienbkhn turned out to be a game changer. But, as a curious man, why is squeeze required?It's my first serious project in Python, 'cause all my previous were in Java or C++, therefore I don't really understand it. And why sample rate has to be set to 16000? When I pass the audio file with 48000 sample rate, it transcribes the audio 100% perfect. Would so grateful for explanaition
from faster-whisper.
@ENDERFUN2 , FYI, you can see this comment to better understand why should use sample_rate=16000. If I use sr=48000 in my example in here, obviously it doesn't work.
For why use squeeze(), sd.rec() func returns an array with shape (duration * sample_rate, 1)
because it records mono audio, resulting in a 2D array with one of the dimensions having size 1. However, FW requires input as a 1D array. So we need use squeeze() to reformat.
from faster-whisper.
Well, it now makes a lot of sense. Thank you both @trungkienbkhn @arunman1kandan for your service, although abstracting it to Legos wasn't necessary, I just didn't understand ndarrays. Also, after some refactoring I noticed that my code is one big pile of garbage and I should reformat it asap. Your answers gave me an important insight
from faster-whisper.
Related Issues (20)
- Faster-whisper issue with the latest NVIDIA 55x series drivers HOT 1
- Will it support c++/c just like whisper.cpp? HOT 2
- With `faster-distil-whisper-large-v3` or `large-v3`, `transcribe` instruction is ignored (it translates instead) HOT 3
- ON arm64 'for segment in segments' run a lot of time HOT 2
- Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives HOT 2
- Having issue in decoding audio chunks properly for fasterWhisper transcribe func
- finetuning encounter multiple errors on the 2nd step (Fine-tuning XTTS Encoder) HOT 1
- clip_timestamps does not work across multiple files [faster-whisper 1.0.2] HOT 3
- What are the ways to improve the speed of continuously recognizing multiple audio files? HOT 1
- Silero-VAD Meta Hallucinations HOT 1
- Limited GPU Utilization with NVIDIA RTX 4000 Ada Gen HOT 13
- The Japanese conversion to the back has always been show thanks for listening ご視聴ありがとうございました what is the reason
- Batch process available? HOT 2
- Word-level timestamps are off after hotwords is setted HOT 1
- Finetuning with Dora HOT 1
- Is there a method or parameter that can filter out noise that is not human voice? HOT 3
- The VAD parameters and default values in the source code is inconsistent with the description in README.md HOT 1
- how can I get more accurate timestamps?
- Can not upload model to hub
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from faster-whisper.