Comments (6)
@George0828Zhang , hello. Tks for your idea. I created a new PR to implement this.
from faster-whisper.
@George0828Zhang , hello. Tks for your idea. I created a new PR to implement this.
Though I appreciate the quick actions, this PR did not fully solve this issue. There's still the tokenizer and preprocesser that needed to be handled. Here's what I'm using, feel free to add to the PR and let's hope it gets merged soon.
import json
import os
from inspect import signature
from typing import List, Optional, Union
import ctranslate2
import tokenizers
from faster_whisper.feature_extractor import FeatureExtractor
from faster_whisper.utils import download_model, get_logger
class MyWhisper(WhisperModel):
def __init__(
self,
model_size_or_path: str,
device: str = "auto",
device_index: Union[int, List[int]] = 0,
compute_type: str = "default",
cpu_threads: int = 0,
num_workers: int = 1,
download_root: Optional[str] = None,
local_files_only: bool = False,
files: object = None,
**kwargs
):
"""
"""
self.logger = get_logger()
tokenizer_bytes, preprocessor_bytes = None, None
if files:
model_path = model_size_or_path
tokenizer_bytes = files.pop("tokenizer.json", None)
preprocessor_bytes = files.pop("preprocessor_config.json", None)
elif os.path.isdir(model_size_or_path):
model_path = model_size_or_path
else:
model_path = download_model(
model_size_or_path,
local_files_only=local_files_only,
cache_dir=download_root,
)
self.model = ctranslate2.models.Whisper(
model_path,
device=device,
device_index=device_index,
compute_type=compute_type,
intra_threads=cpu_threads,
inter_threads=num_workers,
files=files,
**kwargs
)
tokenizer_file = os.path.join(model_path, "tokenizer.json")
if tokenizer_bytes:
self.hf_tokenizer = tokenizers.Tokenizer.from_buffer(tokenizer_bytes)
elif os.path.isfile(tokenizer_file):
self.hf_tokenizer = tokenizers.Tokenizer.from_file(tokenizer_file)
else:
self.hf_tokenizer = tokenizers.Tokenizer.from_pretrained(
"openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
)
self.feat_kwargs = self._get_feature_kwargs(model_path, preprocessor_bytes)
self.feature_extractor = FeatureExtractor(**self.feat_kwargs)
self.num_samples_per_token = self.feature_extractor.hop_length * 2
self.frames_per_second = (
self.feature_extractor.sampling_rate // self.feature_extractor.hop_length
)
self.tokens_per_second = (
self.feature_extractor.sampling_rate // self.num_samples_per_token
)
self.input_stride = 2
self.time_precision = 0.02
self.max_length = 448
def _get_feature_kwargs(self, model_path, preprocessor_bytes=None) -> dict:
preprocessor_config_file = os.path.join(model_path, "preprocessor_config.json")
config = {}
if preprocessor_bytes or os.path.isfile(preprocessor_config_file):
try:
if preprocessor_bytes:
config = json.loads(preprocessor_bytes)
else:
with open(preprocessor_config_file, "r", encoding="utf-8") as json_file:
config = json.load(json_file)
valid_keys = signature(FeatureExtractor.__init__).parameters.keys()
config = {k: v for k, v in config.items() if k in valid_keys}
except json.JSONDecodeError as e:
self.logger.warning(
"Could not load preprocessor_config.json: %s", str(e)
)
return config
from faster-whisper.
@George0828Zhang , I think that if you want to handle tokenizer and preprocessor with other initialization data, you could edit tokenizer.json and preprocessor_config.json files in your custom FW model instead of default files after conversion.
from faster-whisper.
@trungkienbkhn I'm not "handl[ing] tokenizer and preprocessor with other initialization data", I'm not modifying anything in any way. I'm loading these files from memory (rather than disk), i.e. a dictionary like so:
files={
"config.json": open("config.json", "rb").read(),
"tokenizer.json": open("tokenizer.json", "rb").read(),
"model.bin": open("model.bin", "rb").read(),
"vocabulary.txt": open("vocabulary.txt", "rb").read(),
# preprocessor_config.json is optional
}
Naively passing this dict to the underlying ctranslate2.models.Whisper
DOES NOT WORK.
You might ask: Why read it like these? Why not pass the path, or let whisper download it?
Well, this is specifically for the use case where the service 1. has no public internet access and 2. the model files are stored on NAS and 3. the local storage is limited. The solution is to read the bytes from NAS through local network, and then load the model via the bytes.
I provided what does work, so it's not really an issue.
from faster-whisper.
@George0828Zhang , okay I updated my PR.
from faster-whisper.
Since the PR got merged, I'm closing this. Thanks.
from faster-whisper.
Related Issues (20)
- support for running through docker HOT 2
- Model not producing accurate transcriptions in Python HOT 12
- Gibberish Outputs HOT 3
- "Thanks for watching" shows up repeatedly HOT 6
- Faster-whisper issue with the latest NVIDIA 55x series drivers HOT 1
- Will it support c++/c just like whisper.cpp? HOT 2
- With `faster-distil-whisper-large-v3` or `large-v3`, `transcribe` instruction is ignored (it translates instead) HOT 3
- ON arm64 'for segment in segments' run a lot of time HOT 2
- Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives HOT 2
- Having issue in decoding audio chunks properly for fasterWhisper transcribe func
- finetuning encounter multiple errors on the 2nd step (Fine-tuning XTTS Encoder) HOT 1
- clip_timestamps does not work across multiple files [faster-whisper 1.0.2] HOT 3
- What are the ways to improve the speed of continuously recognizing multiple audio files? HOT 1
- Silero-VAD Meta Hallucinations HOT 1
- Limited GPU Utilization with NVIDIA RTX 4000 Ada Gen HOT 13
- The Japanese conversion to the back has always been show thanks for listening ご視聴ありがとうございました what is the reason
- Batch process available? HOT 2
- Word-level timestamps are off after hotwords is setted HOT 1
- Finetuning with Dora HOT 1
- Is there a method or parameter that can filter out noise that is not human voice? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from faster-whisper.