espnet / espnet_model_zoo Goto Github PK

ESPnet Model Zoo

License: Apache License 2.0

Shell 0.97% Python 99.03%

espnet_model_zoo's Introduction

ESPnet Model Zoo

Utilities managing the pretrained models created by ESPnet. This function is inspired by the Asteroid pretrained model function.

From version 0.1.0, the huggingface models can be also used: https://huggingface.co/models?filter=espnet
Zenodo community: https://zenodo.org/communities/espnet/
Registered models: table.csv

Install

pip install torch
pip install espnet_model_zoo

Python API for inference

model_name in the following section should be huggingface_id or one of the tags in the table.csv. Or you can directly provide zenodo URL (e.g., https://zenodo.org/record/xxxxxxx/files/hogehoge.zip?download=1).

ASR

import soundfile
from espnet2.bin.asr_inference import Speech2Text
speech2text = Speech2Text.from_pretrained(
    "model_name",
    # Decoding parameters are not included in the model file
    maxlenratio=0.0,
    minlenratio=0.0,
    beam_size=20,
    ctc_weight=0.3,
    lm_weight=0.5,
    penalty=0.0,
    nbest=1
)
# Confirm the sampling rate is equal to that of the training corpus.
# If not, you need to resample the audio data before inputting to speech2text
speech, rate = soundfile.read("speech.wav")
nbests = speech2text(speech)

text, *_ = nbests[0]
print(text)

TTS

import soundfile
from espnet2.bin.tts_inference import Text2Speech
text2speech = Text2Speech.from_pretrained("model_name")
speech = text2speech("foobar")["wav"]
soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16")

Speech separation

import soundfile
from espnet2.bin.enh_inference import SeparateSpeech
separate_speech = SeparateSpeech.from_pretrained(
    "model_name",
    # for segment-wise process on long speech
    segment_size=2.4,
    hop_size=0.8,
    normalize_segment_scale=False,
    show_progressbar=True,
    ref_channel=None,
    normalize_output_wav=True,
)
# Confirm the sampling rate is equal to that of the training corpus.
# If not, you need to resample the audio data before inputting to speech2text
speech, rate = soundfile.read("long_speech.wav")
waves = separate_speech(speech[None, ...], fs=rate)

This API allows processing both short audio samples and long audio samples. For long audio samples, you can set the value of arguments segment_size, hop_size (optionally normalize_segment_scale and show_progressbar) to perform segment-wise speech enhancement/separation on the input speech. Note that the segment-wise processing is disabled by default.

For old ESPnet (<=10.1)

ASR

import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text
d = ModelDownloader()
speech2text = Speech2Text(
    **d.download_and_unpack("model_name"),
    # Decoding parameters are not included in the model file
    maxlenratio=0.0,
    minlenratio=0.0,
    beam_size=20,
    ctc_weight=0.3,
    lm_weight=0.5,
    penalty=0.0,
    nbest=1
)

TTS

import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.tts_inference import Text2Speech
d = ModelDownloader()
text2speech = Text2Speech(**d.download_and_unpack("model_name"))

Speech separation

import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.enh_inference import SeparateSpeech
d = ModelDownloader()
separate_speech = SeparateSpeech(
    **d.download_and_unpack("model_name"),
    # for segment-wise process on long speech
    segment_size=2.4,
    hop_size=0.8,
    normalize_segment_scale=False,
    show_progressbar=True,
    ref_channel=None,
    normalize_output_wav=True,
)

Instruction for ModelDownloader

from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader("~/.cache/espnet")  # Specify cachedir
d = ModelDownloader()  # <module_dir> is used as cachedir by default

To obtain a model, you need to give a huggingface_idmodel` or a tag , which is listed in table.csv.

>>> d.download_and_unpack("kamo-naoyuki/mini_an4_asr_train_raw_bpe_valid.acc.best")
{"asr_train_config": <config path>, "asr_model_file": <model path>, ...}

You can specify the revision if it's huggingface_id giving with @:

>>> d.download_and_unpack("kamo-naoyuki/mini_an4_asr_train_raw_bpe_valid.acc.best@<revision>")
{"asr_train_config": <config path>, "asr_model_file": <model path>, ...}

Note that if the model already exists, you can skip downloading and unpacking.

You can also get a model with certain conditions.

d.download_and_unpack(task="asr", corpus="wsj")

If multiple models are found with the condition, the last model is selected. You can also specify the condition using "version" option.

d.download_and_unpack(task="asr", corpus="wsj", version=-1)  # Get the last model
d.download_and_unpack(task="asr", corpus="wsj", version=-2)  # Get previous model

You can also obtain it from the URL directly.

d.download_and_unpack("https://zenodo.org/record/...")

If you need to use a local model file using this API, you can also give it.

d.download_and_unpack("./some/where/model.zip")

In this case, the contents are also expanded in the cache directory, but the model is identified by the file path, so if you move the model to somewhere and unpack again, it's treated as another model, thus the contents are expanded again at another place.

Query model names

You can view the model names from our Zenodo community, https://zenodo.org/communities/espnet/, or using query(). All information are written in table.csv.

d.query("name")

You can also show them with specifying certain conditions.

d.query("name", task="asr")

Command line tools

espnet_model_zoo_query

# Query model name
espnet_model_zoo_query task=asr corpus=wsj
# Show all model name
espnet_model_zoo_query
# Query the other key
espnet_model_zoo_query --key url task=asr corpus=wsj

espnet_model_zoo_download

espnet_model_zoo_download <model_name>  # Print the path of the downloaded file
espnet_model_zoo_download --unpack true <model_name>   # Print the path of unpacked files

espnet_model_zoo_upload

export ACCESS_TOKEN=<access_token>
espnet_zenodo_upload \
    --file <packed_model> \
    --title <title> \
    --description <description> \
    --creator_name <your-git-account>

Use pretrained model in ESPnet recipe

# e.g. ASR WSJ task
git clone https://github.com/espnet/espnet
pip install -e .
cd egs2/wsj/asr1
./run.sh --skip_data_prep false --skip_train true --download_model kamo-naoyuki/wsj

Register your model

Huggingface

Upload your model using huggingface API
1. (if you do not have an HF hub account) Go to https://huggingface.co and create an HF account by clicking a sign up button below.
2. From a new model link in the profile, create a new model repository. Please include a recipe name (e.g., aidatatang_200zh) and model info (e.g., conformer) in the repository name
3. In the espnet recipe, execute the following command:
```
./run.sh --stage 15 --skip_upload_hf false --hf_repo sw005320/aidatatang_200zh_conformer
```
1. Please follow the instruction (e.g., type the HF Username/Password)
2. If it works successfully, you can get the following messages
Create a Pull Request to modify table.csv

The model can be registered in table.csv. Then, the model will be tested in the CI. Note that, unlike the zenodo case, you don't need to add the URL because huggingface_id itself can specify the model file, so please fill the value as https://huggingface.co/.

e.g. table.csv
```
...
aidatatang_200zh,asr,sw005320/aidatatang_200zh_conformer,https://huggingface.co/,16000,zh,,,,,true
```
(Administrator does) Increment the third version number of setup.py, e.g. 0.0.3 -> 0.0.4
(Administrator does) Release new version

Zenodo (Obsolete)

Upload your model to Zenodo

You need to signup to Zenodo and create an access token to upload models. You can upload your own model by using espnet_model_zoo_upload command freely, but we normally upload a model using recipes.
Create a Pull Request to modify table.csv

You need to append your record at the last line.
(Administrator does) Increment the third version number of setup.py, e.g. 0.0.3 -> 0.0.4
(Administrator does) Release new version

espnet_model_zoo's People

Contributors

Stargazers

Watchers

espnet_model_zoo's Issues

Update PYPI

Hi @kamo-naoyuki , could you please update the latest version of this repo to the Pypi? I'm working on some demonstrations for espnet2 and would like to include some latest models, which unfortunately is not shown in the pip version. Many thanks!

CSJ's pretrained conformer-based ASR model on zenodo

Hi, I recently ran a successful evaluation for Japanese ASR using the pretrained CSJ transformer-based model. I understand that model zoo now defaults to huggingface for the models, and the published pretrained conformer model is not anymore on the list. How do I replace the run.sh arguments to accommodate the downloading of the conformer model from zenodo?

To clarify, I can download the files using the direct link on zenodo via python, but I'm clarifying whether there's a new way to download via the recipe's run.sh script or the CLI commands.

Uploading ESPnet2 model to Zenodo

Hello, I am trying to export my ESPnet2 model to Zenodo but I got this issue pasted below.
The cause of this is because my status_code is 403 (blocked) here

What should I do to upload the model ?

Thank you !

Traceback (most recent call last):
  File "/jet/home/berrebbi/miniconda3/envs/espnet/bin/espnet_model_zoo_upload", line 8, in <module>
    sys.exit(main())
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/espnet_model_zoo/zenodo_upload.py", line 297, in main
    upload_espnet_model(**kwargs)
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/espnet_model_zoo/zenodo_upload.py", line 230, in upload_espnet_model
    publish=publish,
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/espnet_model_zoo/zenodo_upload.py", line 141, in upload
    r = zenodo.create_deposition()
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/espnet_model_zoo/zenodo_upload.py", line 48, in create_deposition
    raise RuntimeError(r.json()["message"])
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/simplejson/__init__.py", line 525, in loads
    return _default_decoder.decode(s)
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/jet/home/berrebbi/miniconda3/envs/espnet/lib/python3.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0) ```

Can't load chime4 model

Hello,

I've been trying to use the chime4 pretrained model for ASR.
Unfortunately, I can't load the model.

Using the model name doesn't work, the model isn't found.
Using the direct link to zenodo zip file is making it possible to download the model but the init method returns the following error.
__init__() got an unexpected keyword argument 'ignore_nan_grad'
Is this because the model is under an outdated version of ESPNET ? And if so, what the easiest way for me to use this model ?

Request for a default data folder for fallback

Downloaded models are stored per default in the package folder.
However, when espnet_model_zoo is imported as a system library from a distribution package, its path is read-only.

Proposed solution:
Check for output folder access using os.access(modelcache, os.W_OK).
If not writeable, default to a different directory, for example $XDG_CACHE_HOME.

Steps to reproduce the issue:

Install espnet_model_zoo as system package.
Use ESPnet as regular user, not as root.
Then download a model file:

from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader()
wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")

With this, the following exception is thrown:

https://zenodo.org/record/4003381/files/asr_train_asr_transformer_raw_char_valid.acc.ave.zip?download=1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 156M/156M [00:09<00:00, 17.0MB/s]
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-3-64ba61e9d85d> in <module>
----> 1 wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")

/usr/lib/python3.9/site-packages/espnet_model_zoo/downloader.py in download_and_unpack(self, name, version, quiet, **kwargs)
    288 
    289         # Download the file to an unique path
--> 290         filename = self.download(url, quiet=quiet)
    291 
    292         # Extract files from archived file

/usr/lib/python3.9/site-packages/espnet_model_zoo/downloader.py in download(self, name, version, quiet, **kwargs)
    243         # Download the model file if not existing
    244         if not (outdir / filename).exists():
--> 245             download(url, outdir / filename, quiet=quiet)
    246 
    247             # Write the url for debugging

/usr/lib/python3.9/site-packages/espnet_model_zoo/downloader.py in download(url, output_path, retry, chunk_size, quiet)
     82                             pbar.update(len(chunk))
     83 
---> 84         Path(output_path).parent.mkdir(parents=True, exist_ok=True)
     85         shutil.move(Path(d) / "tmp", output_path)
     86 

/usr/lib/python3.9/pathlib.py in mkdir(self, mode, parents, exist_ok)
   1310         """
   1311         try:
-> 1312             self._accessor.mkdir(self, mode)
   1313         except FileNotFoundError:
   1314             if not parents or self.parent == self:

PermissionError: [Errno 13] Permission denied: '/usr/lib/python3.9/site-packages/espnet_model_zoo/b2d27107e15dd714684f5767ef10d402'

ASR demo multiple threads

ASR demo is not thread-safe

Redundant ljspeech vits models

Hi all, I noticed that these two model tags link to the same download. Is there a pre-trained ljspeech vits model with space/pauses?

kan-bayashi/ljspeech_tts_train_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave
kan-bayashi/ljspeech_vits

TypeError: init() got an unexpected keyword argument 'train_config'

TypeError: init() got an unexpected keyword argument 'train_config'

I want to add an original model to "table.csv", but it says I don't have permission and I can't push.

This is my first time submitting an issue. I apologize for any inadequacies.

I want to add the original tts model to "espnet_model_zoo" and when I run the following code, I get the following error message at the push stage.

Executed code
~/espnet_model_zoo$ git push origin develop

Error messages displayed
remote: Permission to espnet/espnet_model_zoo.git denied to c44128. fatal: unable to access 'https://github.com/espnet/espnet_model_zoo.git/': The requested URL returned error: 403

I wanted to create a pull request to have it added to "table.csv", but I am told that I do not have write access to the repository.

I am asking this question because I could not solve the problem on my own.
I would appreciate it if you could tell me how to solve this problem.

Problem with very short and noisy audio during inference when providing xvector embeddings

Hello,

I am pretty new to ESPnet and I am attempting to perform inference using the vctk_tts_train_xvector_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave pretrained model.

Steps Taken:

I used speechbrain/spkrec-xvect-voxceleb to create speaker embeddings for specific voices.
I provided one of these embeddings to the pretrained TTS model.

The problem is that the generated audios are extremely short (0.125 or 0.013 seconds) and sound noisy.

I am using the Python API. I only provided text and spembs fields when calling the Text2Speech class. I also have successfully used the Python API with other pretrained models that do not require speaker embeddings. I am unsure if there are additional arguments or steps required when using this specific model with speaker embeddings.

If more information is needed, I am happy to provide it. Has anyone experienced a similar issue or can provide guidance on how to resolve this?

Thank you for your assistance,

installing on mac silicon py 3.11.4 - sentencepiece building

I cant install espnet_model_zoo as I get this sentencepiece building problem. What version of sentencepiece is this using? I cant figure it out.. as setup.py doesnt have a fixed version.

I have successfully installed pip install sentencepiece - so its installed. All I can think is this requires a different version..

pip install espnet_model_zoo
Collecting espnet_model_zoo
 Using cached espnet_model_zoo-0.1.7-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet_model_zoo) (2.0.3)
Requirement already satisfied: requests in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet_model_zoo) (2.31.0)
Requirement already satisfied: tqdm in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet_model_zoo) (4.66.1)
Requirement already satisfied: numpy in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet_model_zoo) (1.24.3)
Collecting espnet (from espnet_model_zoo)
 Using cached espnet-202402-py3-none-any.whl.metadata (68 kB)
Requirement already satisfied: huggingface-hub in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet_model_zoo) (0.22.2)
Requirement already satisfied: filelock in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet_model_zoo) (3.12.2)
Requirement already satisfied: setuptools>=38.5.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (69.5.1)
Requirement already satisfied: packaging in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (23.1)
Collecting configargparse>=1.2.1 (from espnet->espnet_model_zoo)
 Using cached ConfigArgParse-1.7-py3-none-any.whl.metadata (23 kB)
Collecting typeguard==2.13.3 (from espnet->espnet_model_zoo)
 Using cached typeguard-2.13.3-py3-none-any.whl.metadata (3.6 kB)
Collecting humanfriendly (from espnet->espnet_model_zoo)
 Using cached humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Requirement already satisfied: scipy>=1.4.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (1.11.2)
Collecting librosa==0.9.2 (from espnet->espnet_model_zoo)
 Using cached librosa-0.9.2-py3-none-any.whl.metadata (8.2 kB)
Requirement already satisfied: jamo==0.4.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (0.4.1)
Requirement already satisfied: PyYAML>=5.1.2 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (6.0.1)
Requirement already satisfied: soundfile>=0.10.2 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (0.12.1)
Collecting h5py>=2.10.0 (from espnet->espnet_model_zoo)
 Using cached h5py-3.11.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.5 kB)
Collecting kaldiio>=2.18.0 (from espnet->espnet_model_zoo)
 Using cached kaldiio-2.18.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: torch>=1.11.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (2.1.0)
Collecting torch-complex (from espnet->espnet_model_zoo)
 Using cached torch_complex-0.4.3-py3-none-any.whl.metadata (3.0 kB)
Requirement already satisfied: nltk>=3.4.5 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (3.8.1)
Collecting numpy (from espnet_model_zoo)
 Using cached numpy-1.23.5-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.3 kB)
Requirement already satisfied: protobuf in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (4.24.1)
Collecting hydra-core (from espnet->espnet_model_zoo)
 Using cached hydra_core-1.3.2-py3-none-any.whl.metadata (5.5 kB)
Collecting opt-einsum (from espnet->espnet_model_zoo)
 Using cached opt_einsum-3.3.0-py3-none-any.whl.metadata (6.5 kB)
Collecting sentencepiece==0.1.97 (from espnet->espnet_model_zoo)
 Using cached sentencepiece-0.1.97.tar.gz (524 kB)
 Preparing metadata (setup.py) ... done
Collecting ctc-segmentation>=1.6.6 (from espnet->espnet_model_zoo)
 Using cached ctc_segmentation-1.7.4-cp311-cp311-macosx_13_0_arm64.whl
Collecting pyworld>=0.3.4 (from espnet->espnet_model_zoo)
 Using cached pyworld-0.3.4-cp311-cp311-macosx_13_0_arm64.whl
Collecting pypinyin<=0.44.0 (from espnet->espnet_model_zoo)
 Using cached pypinyin-0.44.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting espnet-tts-frontend (from espnet->espnet_model_zoo)
 Using cached espnet_tts_frontend-0.0.3-py3-none-any.whl.metadata (3.4 kB)
Collecting ci-sdr (from espnet->espnet_model_zoo)
 Using cached ci_sdr-0.0.2-py3-none-any.whl
Collecting fast-bss-eval==0.1.3 (from espnet->espnet_model_zoo)
 Using cached fast_bss_eval-0.1.3-py3-none-any.whl
Requirement already satisfied: asteroid-filterbanks==0.4.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet->espnet_model_zoo) (0.4.0)
Collecting editdistance (from espnet->espnet_model_zoo)
 Using cached editdistance-0.8.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (3.9 kB)
Collecting importlib-metadata<5.0 (from espnet->espnet_model_zoo)
 Using cached importlib_metadata-4.13.0-py3-none-any.whl.metadata (4.9 kB)
Requirement already satisfied: typing-extensions in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from asteroid-filterbanks==0.4.0->espnet->espnet_model_zoo) (4.9.0)
Requirement already satisfied: audioread>=2.1.9 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from librosa==0.9.2->espnet->espnet_model_zoo) (3.0.0)
Requirement already satisfied: scikit-learn>=0.19.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from librosa==0.9.2->espnet->espnet_model_zoo) (1.3.0)
Requirement already satisfied: joblib>=0.14 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from librosa==0.9.2->espnet->espnet_model_zoo) (1.3.2)
Requirement already satisfied: decorator>=4.0.10 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from librosa==0.9.2->espnet->espnet_model_zoo) (4.4.2)
Collecting resampy>=0.2.2 (from librosa==0.9.2->espnet->espnet_model_zoo)
 Using cached resampy-0.4.3-py3-none-any.whl.metadata (3.0 kB)
Requirement already satisfied: numba>=0.45.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from librosa==0.9.2->espnet->espnet_model_zoo) (0.57.0)
Requirement already satisfied: pooch>=1.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from librosa==0.9.2->espnet->espnet_model_zoo) (1.6.0)
Requirement already satisfied: fsspec>=2023.5.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from huggingface-hub->espnet_model_zoo) (2023.6.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from pandas->espnet_model_zoo) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from pandas->espnet_model_zoo) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from pandas->espnet_model_zoo) (2023.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from requests->espnet_model_zoo) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from requests->espnet_model_zoo) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from requests->espnet_model_zoo) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from requests->espnet_model_zoo) (2023.7.22)
Requirement already satisfied: Cython in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from ctc-segmentation>=1.6.6->espnet->espnet_model_zoo) (0.29.30)
Collecting zipp>=0.5 (from importlib-metadata<5.0->espnet->espnet_model_zoo)
 Using cached zipp-3.18.1-py3-none-any.whl.metadata (3.5 kB)
Requirement already satisfied: click in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from nltk>=3.4.5->espnet->espnet_model_zoo) (8.1.7)
Requirement already satisfied: regex>=2021.8.3 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from nltk>=3.4.5->espnet->espnet_model_zoo) (2023.8.8)
Requirement already satisfied: six>=1.5 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas->espnet_model_zoo) (1.16.0)
Requirement already satisfied: cffi>=1.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from soundfile>=0.10.2->espnet->espnet_model_zoo) (1.15.1)
Requirement already satisfied: sympy in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from torch>=1.11.0->espnet->espnet_model_zoo) (1.12)
Requirement already satisfied: networkx in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from torch>=1.11.0->espnet->espnet_model_zoo) (2.8.8)
Requirement already satisfied: jinja2 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from torch>=1.11.0->espnet->espnet_model_zoo) (3.1.2)
Requirement already satisfied: einops in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from ci-sdr->espnet->espnet_model_zoo) (0.6.1)
Collecting unidecode>=1.0.22 (from espnet-tts-frontend->espnet->espnet_model_zoo)
 Using cached Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: inflect>=1.0.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from espnet-tts-frontend->espnet->espnet_model_zoo) (5.6.0)
Collecting jaconv (from espnet-tts-frontend->espnet->espnet_model_zoo)
 Using cached jaconv-0.3.4-py3-none-any.whl
Collecting g2p-en (from espnet-tts-frontend->espnet->espnet_model_zoo)
 Using cached g2p_en-2.1.0-py3-none-any.whl.metadata (4.5 kB)
Requirement already satisfied: omegaconf<2.4,>=2.2 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from hydra-core->espnet->espnet_model_zoo) (2.3.0)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from hydra-core->espnet->espnet_model_zoo) (4.9.3)
Requirement already satisfied: pycparser in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from cffi>=1.0->soundfile>=0.10.2->espnet->espnet_model_zoo) (2.21)
Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from numba>=0.45.1->librosa==0.9.2->espnet->espnet_model_zoo) (0.40.1)
Requirement already satisfied: appdirs>=1.3.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from pooch>=1.0->librosa==0.9.2->espnet->espnet_model_zoo) (1.4.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from scikit-learn>=0.19.1->librosa==0.9.2->espnet->espnet_model_zoo) (3.2.0)
Collecting distance>=0.1.3 (from g2p-en->espnet-tts-frontend->espnet->espnet_model_zoo)
 Using cached Distance-0.1.3-py3-none-any.whl
Requirement already satisfied: MarkupSafe>=2.0 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from jinja2->torch>=1.11.0->espnet->espnet_model_zoo) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages (from sympy->torch>=1.11.0->espnet->espnet_model_zoo) (1.3.0)
Using cached espnet_model_zoo-0.1.7-py3-none-any.whl (19 kB)
Using cached espnet-202402-py3-none-any.whl (1.8 MB)
Using cached librosa-0.9.2-py3-none-any.whl (214 kB)
Using cached typeguard-2.13.3-py3-none-any.whl (17 kB)
Using cached numpy-1.23.5-cp311-cp311-macosx_11_0_arm64.whl (13.3 MB)
Using cached ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Using cached h5py-3.11.0-cp311-cp311-macosx_11_0_arm64.whl (2.9 MB)
Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Using cached kaldiio-2.18.0-py3-none-any.whl (28 kB)
Using cached pypinyin-0.44.0-py2.py3-none-any.whl (1.3 MB)
Using cached editdistance-0.8.1-cp311-cp311-macosx_11_0_arm64.whl (79 kB)
Using cached espnet_tts_frontend-0.0.3-py3-none-any.whl (11 kB)
Using cached humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
Using cached hydra_core-1.3.2-py3-none-any.whl (154 kB)
Using cached opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Using cached torch_complex-0.4.3-py3-none-any.whl (9.1 kB)
Using cached resampy-0.4.3-py3-none-any.whl (3.1 MB)
Using cached Unidecode-1.3.8-py3-none-any.whl (235 kB)
Using cached zipp-3.18.1-py3-none-any.whl (8.2 kB)
Using cached g2p_en-2.1.0-py3-none-any.whl (3.1 MB)
Building wheels for collected packages: sentencepiece
 Building wheel for sentencepiece (setup.py) ... error
 error: subprocess-exited-with-error
 
 × python setup.py bdist_wheel did not run successfully.
 │ exit code: 1
 ╰─> [88 lines of output]
     running bdist_wheel
     running build
     running build_py
     creating build
     creating build/lib.macosx-13.4-arm64-cpython-311
     creating build/lib.macosx-13.4-arm64-cpython-311/sentencepiece
     copying src/sentencepiece/__init__.py -> build/lib.macosx-13.4-arm64-cpython-311/sentencepiece
     copying src/sentencepiece/_version.py -> build/lib.macosx-13.4-arm64-cpython-311/sentencepiece
     copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.macosx-13.4-arm64-cpython-311/sentencepiece
     copying src/sentencepiece/sentencepiece_pb2.py -> build/lib.macosx-13.4-arm64-cpython-311/sentencepiece
     running build_ext
     Package sentencepiece was not found in the pkg-config search path.
     Perhaps you should add the directory containing `sentencepiece.pc'
     to the PKG_CONFIG_PATH environment variable
     No package 'sentencepiece' found
     Cloning into 'sentencepiece'...
     Note: switching to '58f256cf6f01bb86e6fa634a5cc560de5bd1667d'.
     
     You are in 'detached HEAD' state. You can look around, make experimental
     changes and commit them, and you can discard any commits you make in this
     state without impacting any branches by switching back to a branch.
     
     If you want to create a new branch to retain commits you create, you may
     do so (now or later) by using -c with the switch command. Example:
     
       git switch -c <new-branch-name>
     
     Or undo this operation with:
     
       git switch -
     
     Turn off this advice by setting config variable advice.detachedHead to false
     
     ./build_bundled.sh: line 19: cmake: command not found
     ./build_bundled.sh: line 20: nproc: command not found
     ./build_bundled.sh: line 20: cmake: command not found
     Traceback (most recent call last):
       File "<string>", line 2, in <module>
       File "<pip-setuptools-caller>", line 34, in <module>
       File "/private/var/folders/y8/jgtt0gmx5z54mbjd0zx_hb8m0000gn/T/pip-install-2zqoh_nv/sentencepiece_ad33f7f96f2a4cc08ae00c731de66ee4/setup.py", line 136, in <module>
         setup(
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/__init__.py", line 104, in setup
         return distutils.core.setup(**attrs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 184, in setup
         return run_commands(dist)
                ^^^^^^^^^^^^^^^^^^
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
         dist.run_commands()
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
         self.run_command(cmd)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/dist.py", line 967, in run_command
         super().run_command(command)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
         cmd_obj.run()
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 368, in run
         self.run_command("build")
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
         self.distribution.run_command(command)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/dist.py", line 967, in run_command
         super().run_command(command)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
         cmd_obj.run()
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 132, in run
         self.run_command(cmd_name)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
         self.distribution.run_command(command)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/dist.py", line 967, in run_command
         super().run_command(command)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
         cmd_obj.run()
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 91, in run
         _build_ext.run(self)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
         _build_ext.build_ext.run(self)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
         self.build_extensions()
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
         _build_ext.build_ext.build_extensions(self)
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 479, in build_extensions
         self._build_extensions_serial()
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 505, in _build_extensions_serial
         self.build_extension(ext)
       File "/private/var/folders/y8/jgtt0gmx5z54mbjd0zx_hb8m0000gn/T/pip-install-2zqoh_nv/sentencepiece_ad33f7f96f2a4cc08ae00c731de66ee4/setup.py", line 89, in build_extension
         subprocess.check_call(['./build_bundled.sh', __version__])
       File "/Users/willwade/.pyenv/versions/3.11.4/lib/python3.11/subprocess.py", line 413, in check_call
         raise CalledProcessError(retcode, cmd)
     subprocess.CalledProcessError: Command '['./build_bundled.sh', '0.1.97']' returned non-zero exit status 127.
     [end of output]
 
 note: This error originates from a subprocess, and is likely not a problem with pip.
 ERROR: Failed building wheel for sentencepiece
 Running setup.py clean for sentencepiece
Failed to build sentencepiece
ERROR: Could not build wheels for sentencepiece, which is required to install pyproject.toml-based projects

Missing getitem on huggingface page:

Hi, I tried a model from Huggingface (https://huggingface.co/espnet/simpleoier_librispeech_asr_train_asr_conformer7_wavlm_large_raw_en_bpe5000_sp) and copied the code from the "Use in ESPnet" button.
The example was broken, I had to change

text, *_ = model(speech)

text, *_ = model(speech)[0]

According to the readme of espnet_model_zoo, the user has to use the getitem first.
I don't know, how to fix that. Could you fix the example on huggingface?

Here, the examples from huggingface and github with the mismatch of the expected output of Speech2Text:

FileNotFoundError

When I want to use the inference, something wrong related to pretrained model.
How can I place the file".lock"?
FileNotFoundError: [Errno 2] No such file or directory: 'D:\anaconda3\envs\style\lib\site-packages\espnet_model_zoo\79ec90b8bd3dbaba9b8d75d7f7e53392\asr_train_asr_conformer5_raw_bpe5000_frontend_confn_fft512_frontend_confhop_length256_scheduler_confwarmup_steps25000_batch_bins140000000_optim_conflr0.0015_initnone_sp_valid.acc.ave.zip.lock'

Is there a Mandarin multi-speaker pretrained model?

Hi, Is there a Mandarin multi-speaker pretrained model? I didn't find it

Error when test ASR

import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text
d = ModelDownloader()
speech2text = Speech2Text(
    **d.download_and_unpack("Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave"),
    # Decoding parameters are not included in the model file
    maxlenratio=0.0,
    minlenratio=0.0,
    beam_size=20,
    ctc_weight=0.3,
    lm_weight=0.5,
    penalty=0.0,
    nbest=1
)

Error message:

Traceback (most recent call last):
  File "test_asr.py", line 5, in <module>
    speech2text = Speech2Text(
  File "/home/ming-y/anaconda3/envs/espnet/lib/python3.8/site-packages/espnet2/bin/asr_inference.py", line 73, in __init__
    asr_model, asr_train_args = ASRTask.build_model_from_file(
  File "/home/ming-y/anaconda3/envs/espnet/lib/python3.8/site-packages/espnet2/tasks/abs_task.py", line 1834, in build_model_from_file
    model = cls.build_model(args)
  File "/home/ming-y/anaconda3/envs/espnet/lib/python3.8/site-packages/espnet2/tasks/asr.py", line 388, in build_model
    encoder_class = encoder_choices.get_class(args.encoder)
  File "/home/ming-y/anaconda3/envs/espnet/lib/python3.8/site-packages/espnet2/train/class_choices.py", line 75, in get_class
    raise ValueError(
ValueError: --encoder must be one of ('conformer', 'transformer', 'vgg_rnn', 'rnn'): --encoder contextual_block_transformer

I don't know why.

Unpacking a local model

Hi everyone!

I am trying to use espnet_zoo for a locally trained model.

Everything works fine when I use a pre-trained model from Zenodo's table.csv.

However, when I change the model's path to my local trained model (the zip file generated from ESPnet 2) I get the error:

RuntimeError: /home/ubuntu/espnet/tools/venv/lib/python3.7/site-packages/espnet_model_zoo/0abcf46495c3333043c8e3679ea9a844/exp/<my_models_name>/396epoch.pth is a zip archive (did you mean to use torch.jit.load()?)

Environment:

Python: 3.7.3 (default, Mar 27 2019, 22:11:17)
GCC: [GCC 7.3.0]
pytorch: 1.0.1.post2
OS: Linux 18.04

In this python GitHub issue, where they get the same errer they say it is a bug, although it comes from a different command.

Have anyone got this issue when trying to use a local model for zoo?

Thank you in advance!

Huggingface downloader / cache, offline mode

It would be nice to have an option for the downloader so that it isn't pulling git from huggingface if the model is already downloaded on the disk and cached. Right now, even if the model is cached a query to huggingface is made. Also if there is no internet or no connection to huggingface, there seems to be no timeout and the downloder hangs.

Packing a trained model

Hi everyone,

I am trying to implement ESPnet zoo using a model I already trained using the regular ESPnet before ESPnet2 came out.

First, I made it work using a downloaded model I chose from table.csv.

I couldn't find documentation on how to use a pretrained model so I contacted @sw005320, who directed me to this section of the asr.sh script.

My plan now is to create a new script that would only run espnet2.bin.pack asr and take the following arguments:

--lm_train_config <PATH/TO/lm.yaml>
--lm_file <PATH/TO/rnnlm.model.best>
--asr_train_config <PATH/TO/train.yaml>
--asr_model_file <PATH/TO/model.acc.best>
--option <PATH/TO/train_clean_unigram_${nbpe}.model>  # instead of ${bpemodel}
--outpath <PATH/TO/packed_model.zip>

I got a couple of questions:

I am still missing the features normalization file, which in the asr.sh script it's given:
--option ${asr_stats_dir}/train/feats_stats.npz
However, when I trained the model I only generated the file cmvn.ark. How can I parse it as an argument to the python script?
There are other options I am not giving:

${lm_exp}/perplexity_test/ppl
${lm_exp}/images
"${asr_exp}"/RESULTS.md
"${asr_exp}"/images

Are those needed?

Would this work?

Thank you very much,
Daniel

Is it possible to upload some pretrained models of tacotron2 for the libritts dataset?

https://github.com/espnet/espnet/tree/master/egs2/libritts/tts1/conf/tuning
There are many model configs in libritts, but the only models that can be downloaded are transformer and fastspeech2. T T

Using an original model trained in espnet1

Thank you for always providing good tools and continuous support.
I have a good model trained on espnet1 and would like to use it in espnet model zoo, can I use the model trained on espnet1 in model zoo?
I am trying to reproduce the same results with espnet2 but so far have not been successful.

How to get the decoding result scores from

Hi,

Thanks for the work. I am trying to use the pre-trained model, but I don't know how to get the decoding score for the corresponding decoding results.

nbests = speech2text(speech)

text, *_ = nbests[0]

print(text)

The code above only prints text. I would like to get decoding confidence as well.

I checked speech2text class.

for hyp in nbest_hyps:
            assert isinstance(hyp, Hypothesis), type(hyp)

            # remove sos/eos and get results
            token_int = hyp.yseq[1:-1].tolist()

            # remove blank symbol id, which is assumed to be 0
            token_int = list(filter(lambda x: x != 0, token_int))

            # Change integer-ids to tokens
            token = self.converter.ids2tokens(token_int)

            if self.tokenizer is not None:
                text = self.tokenizer.tokens2text(token)
            else:
                text = None
            results.append((text, token, token_int, hyp))

        assert check_return_type(results)
        return results

From the code above I conjecture that the confidence should be obtained from the "hyp", but it is not clear to me how
to parse "hyp" to get the score.

'Speech2Text' has no attribute 'from_pretrained'

insatall espnet_model_zoo and torch successfuly but get follow errors:

speech2text = Speech2Text.from_pretrained(
AttributeError: type object 'Speech2Text' has no attribute 'from_pretrained'

is input.wav's fs must be same to model's fs?

Examples of speech tensor shape for gst models?

I'm trying to figure out how to use models from here in the espnet2 colab demo: https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb

I'm using these values for tag, vocoder_tag, etc:

fs, lang = 24000, "English"
tag = "kan-bayashi/vctk_tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best"
vocoder_tag = "ljspeech_multi_band_melgan.v2"

And trying to get it to run by setting a random value for speech

x = "This is my favorite sentence!"

speech = torch.randn(512, 80) # this is wrong

# synthesis
with torch.no_grad():
    start = time.time()
    wav, c, *_ = text2speech(x, speech=speech)
    wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")

# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=fs))

But I get the error:

RuntimeError: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input (1,.,.) = ...

I feel like this is because the shape of speech is wrong. Are there any examples of how to compute a proper value for it?

I see in the code it's expecting shape (Lmax, idim). I can see that odim is 80 and when I look at the value of text2speech.tts.odim.

On the other hand, perhaps I've misunderstood something obvious about how to approach this. :)

Is there a pretrained model for source separation/speech enhancement?

Hi everybody,
I am trying to test some speech enhancement models but I cannot find any in the table.csv.

Are they somewhere?

soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16") gives an error

Resource Error with a learnt model.

Hi, thank for developing very useful user-friendly library. I have tried to run STT with the following code. However, I wasn't able to finish the code due to Resource error. I found that usage of memory increased to 200GB with htop command.

Environment

CentOS-7
python 3.7.9
pytorch 1.7.1
cudatoolkit 10.2
espnet 0.9.6
espnet-model-zoo 0.0.0a20
4 gpus
80 cpus
300GB RAM

#!/usr/env/python
import soundfile as sf
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text

d = ModelDownloader()
info = d.download_and_unpack(task='asr', corpus='csj')
speech2text = Speech2Text(
    **info)

wav, fs = sf.read("./A02F0116.wav") # csj wav file
text, token, *_ = speech2text(wav)[0]
print(text)

ModuleNotFoundError: No module named 'espnet_model_zoo.downloader'; 'espnet_model_zoo' is not a package

I have created new conda env and install espnet_model_zoo.

I ran this command -

from espnet_model_zoo.downloader import ModelDownloader

got error -

ERROR:root:espnet_model_zoo is not installed. Please install via pip install -U espnet_model_zoo.
Traceback (most recent call last):
File "", line 1, in
File "/home/knit/espnet/my_scripts/espnet_model_zoo.py", line 8, in
speech2text = Speech2Text.from_pretrained(
File "/home/knit/anaconda3/envs/tf2onnx/lib/python3.8/site-packages/espnet2/bin/asr_inference.py", line 358, in from_pretrained
from espnet_model_zoo.downloader import ModelDownloader
ModuleNotFoundError: No module named 'espnet_model_zoo.downloader'; 'espnet_model_zoo' is not a package

librosa.util.exceptions.ParameterError: Window size mismatch: 512 != 400 when using streaming transformer model

I've successfully trained a streaming transformer for German with 13000 hours of data and end-to-end punctuation. See https://huggingface.co/speechcatcher/speechcatcher_german_espnet_streaming_transformer_13k_train_size_m_raw_de_bpe1024 . Espnet2 with asr.sh is really nice, thanks for that!

Following https://github.com/espnet/notebook/blob/master/espnet2_streaming_asr_demo.ipynb, inference also works on the training machine (Linux with Cuda). When I was trying to do inference with my model on a Mac mini (M1, arm) I was getting this error though:


  File "/Users/me/projects/speechcatcher/speechcatcher.py", line 66, in <module>
    recognize("test1.wav")
  File "/Users/me/projects/speechcatcher/speechcatcher.py", line 52, in recognize
    results = speech2text(speech=speech[i*sim_chunk_length:(i+1)*sim_chunk_length], is_final=False)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/espnet2/bin/asr_inference_streaming.py", line 310, in __call__
    feats, feats_lengths, self.frontend_states = self.apply_frontend(
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/espnet2/bin/asr_inference_streaming.py", line 253, in apply_frontend
    feats, feats_lengths = self.asr_model._extract_feats(**batch)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/espnet2/asr/espnet_model.py", line 407, in _extract_feats
    feats, feats_lengths = self.frontend(speech, speech_lengths)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/espnet2/asr/frontend/default.py", line 87, in forward
    input_stft, feats_lens = self._compute_stft(input, input_lengths)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/espnet2/asr/frontend/default.py", line 122, in _compute_stft
    input_stft, feats_lens = self.stft(input, input_lengths)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/espnet2/layers/stft.py", line 139, in forward
    stft = librosa.stft(input[i].numpy(), **stft_kwargs)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/librosa/util/decorators.py", line 88, in inner_f
    return f(*args, **kwargs)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/librosa/core/spectrum.py", line 204, in stft
    fft_window = get_window(window, win_length, fftbins=True)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/librosa/util/decorators.py", line 88, in inner_f
    return f(*args, **kwargs)
  File "/Users/me/projects/speechcatcher/speechcatcher_env/lib/python3.10/site-packages/librosa/filters.py", line 1191, in get_window
    raise ParameterError(
librosa.util.exceptions.ParameterError: Window size mismatch: 512 != 400

I've used the frontend conf from https://github.com/espnet/espnet/blob/master/egs2/jsut/asr1/conf/tuning/train_asr_conformer.yaml#L3-L8 to generate the features on the fly, assuming I'm getting the standard 25ms / 10ms hop filterbank features with this.

In espnet2/layers/stft.py:

    # NOTE(kamo):
    #   The default behaviour of torch.stft is compatible with librosa.stft
    #   about padding and scaling.
    #   Note that it's different from scipy.signal.stft

    # For the compatibility of ARM devices, which do not support
    # torch.stft() due to the lake of MKL.
    if input.is_cuda or torch.backends.mkl.is_available():
           #use torch.stft ....
    else:
           #use librosa ....

Basically claims that both implementations are compatible. On my training machine, it must have used torch.stft and on the mac mini librosa. Seems they are not 100% replaceable implementations of STFT after all. Librosa can't do a window size of 400 with an FFT of 512, there is a comment in the code that says:

Raises
    ------
    ParameterError
        If `window` is supplied as a vector of length != `n_fft`,
        or is otherwise mis-specified.

Do I need to train model with a window size of 512 to make it compatible with librosa?

espnet / espnet_model_zoo Goto Github PK

espnet_model_zoo's Introduction

ESPnet Model Zoo

Install

Python API for inference

ASR

TTS

Speech separation

ASR

TTS

Speech separation

Instruction for ModelDownloader

Query model names

Command line tools

Use pretrained model in ESPnet recipe

Register your model

Huggingface

Zenodo (Obsolete)

espnet_model_zoo's People

Contributors

Stargazers

Watchers

Forkers

espnet_model_zoo's Issues

Environment

Recommend Projects

Recommend Topics

Recommend Org

Jobs