GithubHelp home page GithubHelp logo

Comments (6)

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024

@George0828Zhang , hello. Tks for your idea. I created a new PR to implement this.

from faster-whisper.

George0828Zhang avatar George0828Zhang commented on June 11, 2024

@George0828Zhang , hello. Tks for your idea. I created a new PR to implement this.

Though I appreciate the quick actions, this PR did not fully solve this issue. There's still the tokenizer and preprocesser that needed to be handled. Here's what I'm using, feel free to add to the PR and let's hope it gets merged soon.

import json
import os
from inspect import signature
from typing import List, Optional, Union
import ctranslate2
import tokenizers
from faster_whisper.feature_extractor import FeatureExtractor
from faster_whisper.utils import download_model, get_logger


class MyWhisper(WhisperModel):
    def __init__(
        self,
        model_size_or_path: str,
        device: str = "auto",
        device_index: Union[int, List[int]] = 0,
        compute_type: str = "default",
        cpu_threads: int = 0,
        num_workers: int = 1,
        download_root: Optional[str] = None,
        local_files_only: bool = False,
        files: object = None,
        **kwargs
    ):
        """
        """
        self.logger = get_logger()

        tokenizer_bytes, preprocessor_bytes = None, None
        if files:
            model_path = model_size_or_path
            tokenizer_bytes = files.pop("tokenizer.json", None)
            preprocessor_bytes = files.pop("preprocessor_config.json", None)
        elif os.path.isdir(model_size_or_path):
            model_path = model_size_or_path
        else:
            model_path = download_model(
                model_size_or_path,
                local_files_only=local_files_only,
                cache_dir=download_root,
            )

        self.model = ctranslate2.models.Whisper(
            model_path,
            device=device,
            device_index=device_index,
            compute_type=compute_type,
            intra_threads=cpu_threads,
            inter_threads=num_workers,
            files=files,
            **kwargs
        )

        tokenizer_file = os.path.join(model_path, "tokenizer.json")
        if tokenizer_bytes:
            self.hf_tokenizer = tokenizers.Tokenizer.from_buffer(tokenizer_bytes)
        elif os.path.isfile(tokenizer_file):
            self.hf_tokenizer = tokenizers.Tokenizer.from_file(tokenizer_file)
        else:
            self.hf_tokenizer = tokenizers.Tokenizer.from_pretrained(
                "openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
            )
        self.feat_kwargs = self._get_feature_kwargs(model_path, preprocessor_bytes)
        self.feature_extractor = FeatureExtractor(**self.feat_kwargs)
        self.num_samples_per_token = self.feature_extractor.hop_length * 2
        self.frames_per_second = (
            self.feature_extractor.sampling_rate // self.feature_extractor.hop_length
        )
        self.tokens_per_second = (
            self.feature_extractor.sampling_rate // self.num_samples_per_token
        )
        self.input_stride = 2
        self.time_precision = 0.02
        self.max_length = 448

    def _get_feature_kwargs(self, model_path, preprocessor_bytes=None) -> dict:
        preprocessor_config_file = os.path.join(model_path, "preprocessor_config.json")
        config = {}
        if preprocessor_bytes or os.path.isfile(preprocessor_config_file):
            try:
                if preprocessor_bytes:
                    config = json.loads(preprocessor_bytes)
                else:
                    with open(preprocessor_config_file, "r", encoding="utf-8") as json_file:
                        config = json.load(json_file)
                valid_keys = signature(FeatureExtractor.__init__).parameters.keys()
                config = {k: v for k, v in config.items() if k in valid_keys}
            except json.JSONDecodeError as e:
                self.logger.warning(
                    "Could not load preprocessor_config.json: %s", str(e)
                )

        return config
    

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024

@George0828Zhang , I think that if you want to handle tokenizer and preprocessor with other initialization data, you could edit tokenizer.json and preprocessor_config.json files in your custom FW model instead of default files after conversion.

from faster-whisper.

George0828Zhang avatar George0828Zhang commented on June 11, 2024

@trungkienbkhn I'm not "handl[ing] tokenizer and preprocessor with other initialization data", I'm not modifying anything in any way. I'm loading these files from memory (rather than disk), i.e. a dictionary like so:

files={
    "config.json":  open("config.json", "rb").read(),
    "tokenizer.json": open("tokenizer.json", "rb").read(),
    "model.bin": open("model.bin", "rb").read(),
    "vocabulary.txt": open("vocabulary.txt", "rb").read(),
    # preprocessor_config.json is optional
}

Naively passing this dict to the underlying ctranslate2.models.Whisper DOES NOT WORK.

You might ask: Why read it like these? Why not pass the path, or let whisper download it?
Well, this is specifically for the use case where the service 1. has no public internet access and 2. the model files are stored on NAS and 3. the local storage is limited. The solution is to read the bytes from NAS through local network, and then load the model via the bytes.

I provided what does work, so it's not really an issue.

from faster-whisper.

trungkienbkhn avatar trungkienbkhn commented on June 11, 2024

@George0828Zhang , okay I updated my PR.

from faster-whisper.

George0828Zhang avatar George0828Zhang commented on June 11, 2024

Since the PR got merged, I'm closing this. Thanks.

from faster-whisper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.