tencentgamemate / chinese_speech_pretrain Goto Github PK

chinese speech pretrained models

Shell 100.00%

chinese_speech_pretrain's Introduction

chinese_speech_pretrain

简介

我们使用 WenetSpeech [1] train_l 集的 1 万小时中文数据作为无监督预训练数据。数据主要来源于 YouTube 和 Podcast，覆盖了各种类型录制场景、背景噪声、说话方式等，其领域主要包括有声书、解说、纪录片、电视剧、访谈、新闻、朗读、演讲、综艺和其他等10大场景。我们基于 Fairseq 工具包 [2] 分别训练了 wav2vec 2.0 [3] 和 HuBERT [4] 模型，遵循 [3,4] 中模型配置，每个预训练模型模型包括 BASE 和 LARGE 两种大小。对于 BASE 模型，我们使用 8 张 A100 显卡，梯度累计为 8，模拟 64 张显卡进行训练。对于 LARGE 模型，我们使用 16 张 A100 显卡，梯度累计为 8，模拟 128 张显卡进行训练。

模型下载

为了方便下载，在huggingface模型库里有fairseq模型，如chinese-wav2vec2-base 里的chinese-wav2vec2-base-fairseq-ckpt.pt

(We also provide fairseq checkpoint in huggingface model link, e.g chinese-wav2vec2-base-fairseq-ckpt.pt in chinese-wav2vec2-base )

模型	预训练数据	fairseq模型下载(百度盘)	huggingface & fairseq模型下载
chinese-wav2vec2-base	WenetSpeech train L	chinese-wav2vec2-base 提取码: d2hq	chinese-wav2vec2-base
chinese-wav2vec2-large	WenetSpeech train L	chinese-wav2vec2-large 提取码: 7p8r	chinese-wav2vec2-large
chinese-hubert-base	WenetSpeech train L	chinese-hubert-base 提取码: xjiy	chinese-hubert-base
chinese-hubert-large	WenetSpeech train L	chinese-hubert-large 提取码: hhn7	chinese-hubert-large

下游任务：中文语音识别

为了验证预训练模型在下游 ASR 任务的效果，我们遵循 ESPnet [5,6,7] 工具包中的 Conformer [8] 模型实验配置，即将预训练模型作为特征提取器，对于输入语音提取预训练模型各隐层表征进行加权求和，得到的语音表征将替换传统 FBank 特征作为 Conformer ASR 模型的输入。

Aishell 数据集实验结果

我们使用 Aishell 178 小时训练集作为有监督数据进行训练，分别对比了使用 FBank 特征、wav2vec 2.0 BASE/LARGE 模型特征和 HuBERT BASE/LARGE 模型特征的字错误率 (Character Error Rate, CER) 结果。同时，我们额外对比了使用 WenetSpeech train_l 集 1 万小时中文数据进行训练时，其在 Aishell 测试集上的效果。训练数据使用了变速（0.9、1.0、1.1 倍）和 SpecAugment 数据增广技术，解码方式为 beam search，使用了基于 Transformer 的语言模型进行 rescoring。具体实验结果见下表：

输入特征	训练数据	Dev	Test
FBank [6]	178h	4.4	4.7
FBank [1]	1wh	/	3.9
Wav2vec 2.0 BASE	178h	4.2	4.7
Wav2vec 2.0 LARGE	178h	3.8	4.1
HuBERT Base	178h	4.1	4.3
HuBERT LARGE	178h	3.1	3.3

WenetSpeech 实验结果

我们使用 WenetSpeech train_s 100h 数据集作为有监督数据进行训练，分别对比了使用 FBank 特征、wav2vec 2.0 模型特征和 HuBERT 模型特征的字错误率 (Character Error Rate, CER) 结果。同时，额外对比了使用 train_m 集 1000h 和 train_l 集 1wh 中文数据 FBank 特征训练的模型结果。训练数据没有使用变速或 SpecAugment 数据增广技术，解码方式为 beam search，没有使用语言模型 rescoring。具体实验结果见下表：

输入特征	训练数据	Dev 集	Test_Net 集	Test_Meeting 集
FBank	100h	17.4	22.6	32.7
FBank	1000h	11.6	14.6	22.4
FBank	1wh	9.7	8.9	15.9
wav2vec 2.0 BASE	100h	13.1	16.1	25.5
wav2vec 2.0 LARGE	100h	11.7	13.8	25.5
HuBERT BASE	100h	12.6	14.7	21.3
HuBERT LARGE	100h	10.0	10.2	14.5

模型使用

# This model does not have a tokenizer as it was pretrained on audio alone. 
# In order to use this model speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.

# python package
# transformers==4.16.2

# fairseq 使用
import torch
import torch.nn.functional as F
import soundfile as sf
from fairseq import checkpoint_utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_path=""
wav_path=""

def postprocess(feats, normalize=False):
    if feats.dim() == 2:
        feats = feats.mean(-1)

    assert feats.dim() == 1, feats.dim()

    if normalize:
        with torch.no_grad():
            feats = F.layer_norm(feats, feats.shape)
    return feats

print("loading model(s) from {}".format(model_path))
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
    [model_path],
    suffix="",
)
print("loaded model(s) from {}".format(model_path))
print(f"normalize: {saved_cfg.task.normalize}")


model = models[0]
model = model.to(device)
model = model.half()
model.eval()

wav, sr = sf.read(wav_path)
feat = torch.from_numpy(wav).float()
feat = postprocess(feat, normalize=saved_cfg.task.normalize)
feats = feat.view(1, -1)
padding_mask = (
    torch.BoolTensor(feats.shape).fill_(False)
)
inputs = {
    "source": feats.half().to(device),
    "padding_mask": padding_mask.to(device),
}

with torch.no_grad():
    logits = model.extract_features(**inputs)


# huggingface 使用

import torch
import torch.nn.functional as F
import soundfile as sf
from fairseq import checkpoint_utils

from transformers import (
    Wav2Vec2FeatureExtractor,
    Wav2Vec2ForPreTraining,
    Wav2Vec2Model,
)
from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices

model_path=""
wav_path=""
mask_prob=0.0
mask_length=10

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_path)
model = Wav2Vec2Model.from_pretrained(model_path)

# for pretrain: Wav2Vec2ForPreTraining
# model = Wav2Vec2ForPreTraining.from_pretrained(model_path)

model = model.to(device)
model = model.half()
model.eval()

wav, sr = sf.read(wav_path)
input_values = feature_extractor(wav, return_tensors="pt").input_values
input_values = input_values.half()
input_values = input_values.to(device)

# for Wav2Vec2ForPreTraining
# batch_size, raw_sequence_length = input_values.shape
# sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length)
# mask_time_indices = _compute_mask_indices((batch_size, sequence_length), mask_prob=0.0, mask_length=2)
# mask_time_indices = torch.tensor(mask_time_indices, device=input_values.device, dtype=torch.long)

with torch.no_grad():
    outputs = model(input_values)
    last_hidden_state = outputs.last_hidden_state

    # for Wav2Vec2ForPreTraining
    # outputs = model(input_values, mask_time_indices=mask_time_indices, output_hidden_states=True)
    # last_hidden_state = outputs.hidden_states[-1]

欢迎大家使用我们提供的中文语音预训练模型开展研究工作，一起探索语音预训练模型在中文和相关众多场景下的应用。

使用了我们模型的项目

以下项目使用了我们的模型

项目	项目地址
GPT-SoVITS	GPT-SoVITS

引用本项目

@misc{TencentGameMate,
title={chinese_speech_pretrain},
author = {Pengcheng Guo and Shixing Liu},
year = {2022},
url = {https://github.com/TencentGameMate/chinese_speech_pretrain},
}

参考文献

[1] Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenhen Zeng, Di Wu, and Zhendong Peng, "WenetSpeech: A 10000+ hours multi-domain Mandarin corpus for speech recognition," in Proc. ICASSP, 2021

[2] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli, "fairseq: A fast, extensible toolkit for sequence modeling," in Proc. NAACL, 2019.

[3] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," in Proc. NeurIPS, 2020.

[4] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, "HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions of Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021

[5] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-end speech processing toolkit," in Proc. Interspeech, 2018, pp. 2207–2211

[6] Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang and Yuekai Zhang, "Recent development on ESPnet tookit boosted by Conformer," in Proc. ICASSP, 2021

[7] Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee, and Shinji Watanabe, "An exploratino of self-supervised pretrained representations for end-to-end speech recognition," in Proc. ASRU, 2021

[8] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pan, "Conformer: Convolution-augmented Transformer for speech recognition," in Proc. Interspeech, 2020, pp.5036–5040

chinese_speech_pretrain's People

Contributors

Stargazers

Watchers

chinese_speech_pretrain's Issues

请问hubert模型训练时的batch_size大小是多少

请问预训练好模型之后提取音频特征时加权求和的具体做法是什么？

如题

如何测试？

请问想看一下提供的预训练模型在中文语音上转换文字的效果该怎么可视化？小白一枚求指教

如何提取音频特征

我刚试了一下实例代码

# fairseq 使用
import torch
import torch.nn.functional as F
import soundfile as sf
from fairseq import checkpoint_utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_path=""
wav_path=""

def postprocess(feats, normalize=False):
    if feats.dim() == 2:
        feats = feats.mean(-1)

    assert feats.dim() == 1, feats.dim()

    if normalize:
        with torch.no_grad():
            feats = F.layer_norm(feats, feats.shape)
    return feats

print("loading model(s) from {}".format(model_path))
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
    [model_path],
    suffix="",
)
print("loaded model(s) from {}".format(model_path))
print(f"normalize: {saved_cfg.task.normalize}")


model = models[0]
model = model.to(device)
model = model.half()
model.eval()

wav, sr = sf.read(wav_path)
feat = torch.from_numpy(wav).float()
feat = postprocess(feat, normalize=saved_cfg.task.normalize)
feats = feat.view(1, -1)
padding_mask = (
    torch.BoolTensor(feats.shape).fill_(False)
)
inputs = {
    "source": feats.half().to(device),
    "padding_mask": padding_mask.to(device),
}

with torch.no_grad():
    logits = model.extract_features(**inputs)
    print(logits)

返回是这样子的

不知道是不是把这块的数据提取出来做reshape就可以了

与原始版本预训练模型对比

你好，请问你们有在中文场景和原来的预训练模型做过详细对比吗？

.

vocab相关

你好，下载的模型中没有提供vocab.json，而且我看config.json中的vocab_size是32，这个应该是英文的词汇表大小吧

有没有更详细的教程

比方说按步骤执行某种任务样例等. 达到什么效果...这样才会有更多的粉丝.才会更快的发扬光大

HuBERT模型对应的kmeans模型

你好，感谢你们提供的中文语音预训练模型。
请问能否提供训练HuBERT模型quantization用的kmeans模型，以及用作kmeans聚类的特征信息？谢谢。

hubert特征，用的是哪层的特征啊，还是哪些层的特征进行了加权和？比例是多少

您好：
如题，望回复！

About fairseq checkpoint link

Hi, can you kindly provide links to fairseq model checkpoints outside pan.baidu.com? (restricted access) Thanks

求一个能够输出最终文字的代码案例

请问，给的代码案例只能拿到一些中间的结果值，对于最终预测的文本，有相关函数或者试用说明吗

Is there any usage?

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch
import librosa

wav, sr = librosa.load('/workspaces/speech/download/pengz/wav/PENGZ/PENGZPENGZ00005.wav')

processor = Wav2Vec2Processor.from_pretrained("wav2vec2/chinese-wav2vec2-base")
model = Wav2Vec2Model.from_pretrained("wav2vec2/chinese-wav2vec2-base")

inputs = processor(wav, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
list(last_hidden_states.shape)

Am I use correct? After I ran above code, it gives me error about 'Can't load tokenizer'.

New pretrained model support?

Will you support new model? like：

请问如何使用huggingface代码finetune

请问如何使用huggingface代码finetune hubert large，以及是否有其它配套部分如ctc解码的代码？

Add WavLM

您好，

这个工作非常的棒，但是我想知道你们是否有将 WavLM 模型添加到该库中的打算？

Error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

When call feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("hubert-large/chinese-hubert-large.pt")

你好，有WavLM的中文预训练模型吗？

用CTC直接微调效果非常差

在common voice上使用huggingface的Wav2Vec2ForCTC进行微调，模型加载

model = Wav2Vec2ForCTC.from_pretrained(
                                        model_path,
                                        ctc_loss_reduction='mean',
                                        pad_token_id=tokenizer.pad_token_id,
                                        vocab_size=len(tokenizer))
model.freeze_feature_extractor()

loss从160直接干到4就不降了，预测的时候输入一句话，其输出向量每一行都是一样的，也就是说只输出同一个字，如图
train loss

eval loss

logits

pred_id

怎么调都无法成功。
但是在common voice上用facebook/wav2vec2-large-xlsr-53微调是有效果的

k-means参数的读取

你好，k-means模型中的model.mdl和dict.km.txt怎么读取和使用呢？烦请告知

请问能上传下游训练好的espnet模型权重吗

非常棒的工作，非常感谢。readme 放置的是pretrain的文件。请问是否能够放出下游训练好的espnet模型权重，这样就一些需要ASR
监督信息的任务，不用再训练模型啦。

模型小型化

现阶段Base 模型对机器性能也很高（考虑到序列长度），有更小的模型，或者有尝试过模型小型化的实验吗。

Problem about time shape

I have tried this code to extract phonetic feature before the last linear

import torch
import torch.nn.functional as F
import soundfile as sf
import librosa
import pandas as pd
import numpy as np
from transformers import (
    Wav2Vec2FeatureExtractor,
    Wav2Vec2ForPreTraining,
    Wav2Vec2Model,
)
mask_prob=0.0
mask_length=10

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('/home/tuht/PAPL-Attention/pretrained_china')
model = Wav2Vec2ForPreTraining.from_pretrained('/home/tuht/PAPL-Attention/pretrained_china')
model = model.to('cuda')
model = model.eval()
newmodel = torch.nn.Sequential(*(list(model.children())[:-4]))
# print(newmodel)
newmodel.to('cuda')
newmodel.eval()

def phonetic_embedding(path):
    wav, sr = librosa.load(path)
    input_values = feature_extractor(wav, return_tensors="pt",sampling_rate = 16000).input_values
    input_values = input_values.to('cuda')
    with torch.no_grad():
        outputs = newmodel(input_values)
        last_hidden_state = outputs.last_hidden_state
    x = last_hidden_state.squeeze(0).detach().cpu().numpy()
    return x

print(phonetic_embedding("/home/tuht/mandarin_acoustic/000100126.WAV").shape)

and the output:

(644, 768)

My audio: 000100126.WAV have duration: 9.36s, sample_rate 16000
As I know, frame is 20ms, so expectation output should be 9.36/0.02-1 =>

(467, 768)

I don't know why it has this different?
Can you explain this? @pengchengguo @LiuShixing
Thank you

开源出来的hubert large 模型，有对应的kmean模型么？还是base和large使用同一个kmeans就可以？

如何获得最后的unit?

如题

这个可以用于speaker-diarization任务吗

有相关的demo吗，求教

Fine-tune with my own dataset, wer is 1

finetune according to this blog： https://huggingface.co/blog/fine-tune-wav2vec2-english，
But this problem arises，Here it is train_state.json:
"best_metric": null,
"best_model_checkpoint": null,
"epoch": 18.9873417721519,
"global_step": 1500,
"is_hyper_param_search": false,
"is_local_process_zero": true,
"is_world_process_zero": true,
"log_history": [
{
"epoch": 6.33,
"learning_rate": 3.49e-05,
"loss": 14.8414,
"step": 500
},
{
"epoch": 6.33,
"eval_loss": NaN,
"eval_runtime": 36.5218,
"eval_samples_per_second": 49.286,
"eval_steps_per_second": 2.054,
"eval_wer": 0.9999660072064722,
"step": 500
},
{
"epoch": 12.66,
"learning_rate": 8.34e-05,
"loss": 0.0,
"step": 1000
},
{
"epoch": 12.66,
"eval_loss": NaN,
"eval_runtime": 36.5778,
"eval_samples_per_second": 49.21,
"eval_steps_per_second": 2.05,
"eval_wer": 0.9999660072064722,
"step": 1000
},
{
"epoch": 18.99,
"learning_rate": 7.562043795620438e-05,
"loss": 0.0,
"step": 1500
},
{
"epoch": 18.99,
"eval_loss": NaN,
"eval_runtime": 36.5273,
"eval_samples_per_second": 49.278,
"eval_steps_per_second": 2.053,
"eval_wer": 0.9999660072064722,
"step": 1500
}
],
"max_steps": 2370,
"num_train_epochs": 30,
"total_flos": 4.074416087326559e+18,
"trial_name": null,
"trial_params": null
}

What could be the problem?

能期待下vq-wav2vec的自监督backbone吗？

可以提取采样率为22050的音频的特征吗？

如何获得1024维特征的离散id

请问有codebook在ckpt里吗，没有找到

请问如何用 fairseq 训练 wenetspeech

大佬，能把 fairseq 训练 wenetspeech 的脚本、配置文件上传一下吗？我想复现这个过程，但是一些参数不知道应该设为多少，比如 kmeans 聚类时的 cluster 数量，dataset 的 max_tokens 等。

ASR finetune收敛速度问题

首先非常感谢开源了中文预训练版本的wav2vec2.0！我想问一下您这边在做ASR finetune时，大概第几个epoch开始收敛呢？数据量有多大？

Failed to load pretrained model from huggingface

Hi,

I am trying to load pretrained model (Hubert_large) in espnet setup, but I failed.

The steps are listed here:

git clone https://huggingface.co/TencentGameMate/chinese-hubert-large hub/s3prl_cache/chinese-hubert-large
The modify asr_train_config :

num_workers: 8
batch_type: numel
batch_bins: 4000000
accum_grad: 4
max_epoch: 50
patience: none
init: none
best_model_criterion:
-   - valid
    - acc
    - max
keep_nbest_models: 10
freeze_param: [
"frontend.upstream"
]

frontend: s3prl
frontend_conf:
    frontend_conf:
        upstream: hubert_local
        upstream_model_config: "hub/s3prl_cache/chinese-hubert-large/config.json"
        upstream_ckpt: "hub/s3prl_cache/chinese-hubert-large/chinese-hubert-large-fairseq-ckpt.pt"
    multilayer_feature: true

preencoder: linear
preencoder_conf:
    input_size: 1024  # Note: If the upstream is changed, please change this value accordingly.
    output_size: 80

encoder: conformer
encoder_conf:
    output_size: 256
    attention_heads: 4
    linear_units: 2048
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d2
    normalize_before: true
    macaron_style: true
    pos_enc_layer_type: "rel_pos"
    selfattention_layer_type: "rel_selfattn"
    activation_type: "swish"
    use_cnn_module: true

Error Message

Traceback (most recent call last):
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/teinhonglo/espnets/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/teinhonglo/espnets/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/teinhonglo/espnets/espnet/espnet2/tasks/abs_task.py", line 1013, in main
    cls.main_worker(args)
  File "/home/teinhonglo/espnets/espnet/espnet2/tasks/abs_task.py", line 1115, in main_worker
    model = cls.build_model(args=args)
  File "/home/teinhonglo/espnets/espnet/espnet2/tasks/asr.py", line 415, in build_model
    frontend = frontend_class(**args.frontend_conf)
  File "/home/teinhonglo/espnets/espnet/espnet2/asr/frontend/s3prl.py", line 47, in __init__
    self.upstream, self.featurizer = self._get_upstream(frontend_conf)
  File "/home/teinhonglo/espnets/espnet/espnet2/asr/frontend/s3prl.py", line 68, in _get_upstream
    s3prl_upstream = torch.hub.load(
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/hub.py", line 404, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/hub.py", line 433, in _load_local
    model = entry(*args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/s3prl/s3prl/upstream/hubert/hubconf.py", line 27, in hubert_local
    return _UpstreamExpert(ckpt, *args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/s3prl/s3prl/upstream/interfaces.py", line 30, in __call__
    instance = super().__call__(*args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/s3prl/s3prl/upstream/hubert/expert.py", line 42, in __init__
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 421, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 315, in load_checkpoint_to_cpu
    state = torch.load(f, map_location=torch.device("cpu"))
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.
# Accounting: time=17 threads=1

Any suggestion?
Thanks in advance.

How many days did the pre-training phase take on large model?

How many days did the pre-training phase take on base model and large model?

可以同时提取中英文语音的特征吗

感谢开源，请问这版预训练模型可以同时处理中英文吗，还是只支持中文呢

请问该预训练模型们的语音的采样率是多少呢？

如题

最终输出是768维还是1024维呢？

我看模型参数里是'final_dim': 768，但是最终输出的特征维度是1024维

请问最长能处理多长的语音？

16kHz的输入，Hubert-Large 能处理单条长度多长的语音呢？

您好，改怎么进行微调呢？

非常感谢您提供的模型，我进行微调报错了：
Can't load tokenizer for 'TencentGameMate/chinese-wav2vec2-large'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'TencentGameMate/chinese-wav2vec2-large' is the correct path to a directory containing all relevant tokenizer files.

预训练超参mask_prob设置

您好，请问一下在预训练Hubert-base的时候，yaml里的mask_prob您是怎么设置的呀

能否使用预训练模型同时更改参数？

能否在使用贵方提供的预训练模型的基础上，修改模型的参数？比如修改CNN的卷积核大小？

关于该项目的bibtex格式引用

请问该项目bibtex格式的引用是什么样的？我制作的bibtex格式引用如下：
@misc{TencentGameMate,
title={chinese_speech_pretrain},
author = {Pengcheng Guo and Shixing Liu},
year = {2022},
url = {https://github.com/TencentGameMate/chinese_speech_pretrain},
}
请问这个引用可以吗？

可以用作特征的是哪个字段里面的值

你好，利用fairseq推理得到一个字典，该字典的那个字段是可以用作音频的特征？现在的字段有x，features等

你好请问large的特征聚类的时候使用了百分之多少的特征?10%的话需要内存多大的机器？

采用预训练模型提取语音特征，怎么处理长语音，直接切割或滑窗处理？

采用hubert提取特征时，发现语音太长，占用的显存变多。

请问还传ESPnet的训练代码吗？

请问wenet speech中用于训练的100小时数据选取有技巧吗？还是任意选取都可以？

fairseq和huggingface输出结果不同

我按照您README中提供的代码，分别使用fairseq和huggingface模型测试同一段音频，输出维度一样，但特征值不同，请问一下是什么原因。

关于模型中没有task_cfg、model_cfg、model_weight、dictionaries_symbols这一问题，求大佬解答

大佬您好，很高兴我能看到大佬开源的中文hubert预训练模型，在我看来该开源的模型非常具有学习价值。由于比较缺乏语音识别训练的经验，在我使用大佬给出的espnet文件夹中的配置文件和chinese-hubert-base.pt进行微调尝试时,卡在了asr.sh文件的stage 10的部分。

具体问题是在convert.py文件中报错：
File "/home/oem/workspace/s3prl-master/s3prl/upstream/hubert/convert.py", line 47, in load_converted_model
raise ValueError(
ValueError: /home/oem/workspace/hubert-base/espnet-master/egs2/aishell/asr1/checkpoint_best.pt is not a valid checkpoint since the required key: task_cfg is missing

而我自己查看了一下：chinese-hubert-base.pt模型中并没有task_cfg、model_cfg、model_weight、dictionaries_symbols这些关键词，导致了convert.py文件的报错。
因此猜想，是否是train_asr_conformer_hubert-base.yaml配置文件中的upstream_model_config没有进行书写导致的。还是说为还有其他没有注意到的点呢？
如果大佬能看到，真心希望能得到您的解答，期待得到您的回复。

请问我们在espnet/egs2/aishell/asr1/下使用，报TypeError: wav2vec2_custom() missing 1 required positional argument: 'ckpt'错误，怎么解决，非常感谢！！！

conf/train_asr_conformer_w2v2-base.yaml文件中做修改如下：

network architecture

pretrained model related

freeze_param: [
"frontend.upstream"
]

frontend: s3prl
frontend_conf:
frontend_conf:
upstream: wav2vec2_local
upstream_model_config:
upstream_ckpt: ./chinese-wav2vec2-base.pt
download_dir: ./hub
multilayer_feature: true

preencoder: linear
preencoder_conf:
input_size: 768 # Note: If the upstream is changed, please change this value accordingly.
output_size: 80
其中upstream_ckpt:中的选项后面 ./chinese-wav2vec2-base.pt是您们已经训练好的模型，不知道这样修改得对不对？
非常感谢！！！