GithubHelp home page GithubHelp logo

chinese_speech_pretrain's Introduction

chinese_speech_pretrain

简介

我们使用 WenetSpeech [1] train_l 集的 1 万小时中文数据作为无监督预训练数据。数据主要来源于 YouTube 和 Podcast,覆盖了各种类型录制场景、背景噪声、说话方式等,其领域主要包括有声书、解说、纪录片、电视剧、访谈、新闻、朗读、演讲、综艺和其他等10大场景。我们基于 Fairseq 工具包 [2] 分别训练了 wav2vec 2.0 [3] 和 HuBERT [4] 模型,遵循 [3,4] 中模型配置,每个预训练模型模型包括 BASE 和 LARGE 两种大小。对于 BASE 模型,我们使用 8 张 A100 显卡,梯度累计为 8,模拟 64 张显卡进行训练。对于 LARGE 模型,我们使用 16 张 A100 显卡,梯度累计为 8,模拟 128 张显卡进行训练。

模型下载

为了方便下载,在huggingface模型库里有fairseq模型,如chinese-wav2vec2-base 里的chinese-wav2vec2-base-fairseq-ckpt.pt

(We also provide fairseq checkpoint in huggingface model link, e.g chinese-wav2vec2-base-fairseq-ckpt.pt in chinese-wav2vec2-base )

模型 预训练数据 fairseq模型下载(百度盘) huggingface & fairseq模型下载
chinese-wav2vec2-base WenetSpeech train L chinese-wav2vec2-base 提取码: d2hq chinese-wav2vec2-base
chinese-wav2vec2-large WenetSpeech train L chinese-wav2vec2-large 提取码: 7p8r chinese-wav2vec2-large
chinese-hubert-base WenetSpeech train L chinese-hubert-base 提取码: xjiy chinese-hubert-base
chinese-hubert-large WenetSpeech train L chinese-hubert-large 提取码: hhn7 chinese-hubert-large

下游任务:中文语音识别

为了验证预训练模型在下游 ASR 任务的效果,我们遵循 ESPnet [5,6,7] 工具包中的 Conformer [8] 模型实验配置,即将预训练模型作为特征提取器,对于输入语音提取预训练模型各隐层表征进行加权求和,得到的语音表征将替换传统 FBank 特征作为 Conformer ASR 模型的输入。

Aishell 数据集 实验结果

我们使用 Aishell 178 小时训练集作为有监督数据进行训练,分别对比了使用 FBank 特征、wav2vec 2.0 BASE/LARGE 模型特征和 HuBERT BASE/LARGE 模型特征的字错误率 (Character Error Rate, CER) 结果。同时,我们额外对比了使用 WenetSpeech train_l 集 1 万小时中文数据进行训练时,其在 Aishell 测试集上的效果。训练数据使用了变速(0.9、1.0、1.1 倍)和 SpecAugment 数据增广技术,解码方式为 beam search,使用了基于 Transformer 的语言模型进行 rescoring。具体实验结果见下表:

输入特征 训练数据 Dev Test
FBank [6] 178h 4.4 4.7
FBank [1] 1wh / 3.9
Wav2vec 2.0 BASE 178h 4.2 4.7
Wav2vec 2.0 LARGE 178h 3.8 4.1
HuBERT Base 178h 4.1 4.3
HuBERT LARGE 178h 3.1 3.3

WenetSpeech 实验结果

我们使用 WenetSpeech train_s 100h 数据集作为有监督数据进行训练,分别对比了使用 FBank 特征、wav2vec 2.0 模型特征和 HuBERT 模型特征的字错误率 (Character Error Rate, CER) 结果。同时,额外对比了使用 train_m 集 1000h 和 train_l 集 1wh 中文数据 FBank 特征训练的模型结果。训练数据没有使用变速或 SpecAugment 数据增广技术,解码方式为 beam search,没有使用语言模型 rescoring。具体实验结果见下表:

输入特征 训练数据 Dev 集 Test_Net 集 Test_Meeting 集
FBank 100h 17.4 22.6 32.7
FBank 1000h 11.6 14.6 22.4
FBank 1wh 9.7 8.9 15.9
wav2vec 2.0 BASE 100h 13.1 16.1 25.5
wav2vec 2.0 LARGE 100h 11.7 13.8 25.5
HuBERT BASE 100h 12.6 14.7 21.3
HuBERT LARGE 100h 10.0 10.2 14.5

模型使用

# This model does not have a tokenizer as it was pretrained on audio alone. 
# In order to use this model speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.

# python package
# transformers==4.16.2

# fairseq 使用
import torch
import torch.nn.functional as F
import soundfile as sf
from fairseq import checkpoint_utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_path=""
wav_path=""

def postprocess(feats, normalize=False):
    if feats.dim() == 2:
        feats = feats.mean(-1)

    assert feats.dim() == 1, feats.dim()

    if normalize:
        with torch.no_grad():
            feats = F.layer_norm(feats, feats.shape)
    return feats

print("loading model(s) from {}".format(model_path))
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
    [model_path],
    suffix="",
)
print("loaded model(s) from {}".format(model_path))
print(f"normalize: {saved_cfg.task.normalize}")


model = models[0]
model = model.to(device)
model = model.half()
model.eval()

wav, sr = sf.read(wav_path)
feat = torch.from_numpy(wav).float()
feat = postprocess(feat, normalize=saved_cfg.task.normalize)
feats = feat.view(1, -1)
padding_mask = (
    torch.BoolTensor(feats.shape).fill_(False)
)
inputs = {
    "source": feats.half().to(device),
    "padding_mask": padding_mask.to(device),
}

with torch.no_grad():
    logits = model.extract_features(**inputs)


# huggingface 使用

import torch
import torch.nn.functional as F
import soundfile as sf
from fairseq import checkpoint_utils

from transformers import (
    Wav2Vec2FeatureExtractor,
    Wav2Vec2ForPreTraining,
    Wav2Vec2Model,
)
from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices

model_path=""
wav_path=""
mask_prob=0.0
mask_length=10

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_path)
model = Wav2Vec2Model.from_pretrained(model_path)

# for pretrain: Wav2Vec2ForPreTraining
# model = Wav2Vec2ForPreTraining.from_pretrained(model_path)

model = model.to(device)
model = model.half()
model.eval()

wav, sr = sf.read(wav_path)
input_values = feature_extractor(wav, return_tensors="pt").input_values
input_values = input_values.half()
input_values = input_values.to(device)

# for Wav2Vec2ForPreTraining
# batch_size, raw_sequence_length = input_values.shape
# sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length)
# mask_time_indices = _compute_mask_indices((batch_size, sequence_length), mask_prob=0.0, mask_length=2)
# mask_time_indices = torch.tensor(mask_time_indices, device=input_values.device, dtype=torch.long)

with torch.no_grad():
    outputs = model(input_values)
    last_hidden_state = outputs.last_hidden_state

    # for Wav2Vec2ForPreTraining
    # outputs = model(input_values, mask_time_indices=mask_time_indices, output_hidden_states=True)
    # last_hidden_state = outputs.hidden_states[-1]

欢迎大家使用我们提供的中文语音预训练模型开展研究工作,一起探索语音预训练模型在中文和相关众多场景下的应用。





使用了我们模型的项目

以下项目使用了我们的模型

项目 项目地址
GPT-SoVITS GPT-SoVITS

引用本项目

@misc{TencentGameMate,
title={chinese_speech_pretrain},
author = {Pengcheng Guo and Shixing Liu},
year = {2022},
url = {https://github.com/TencentGameMate/chinese_speech_pretrain},
}

参考文献

[1] Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenhen Zeng, Di Wu, and Zhendong Peng, "WenetSpeech: A 10000+ hours multi-domain Mandarin corpus for speech recognition," in Proc. ICASSP, 2021

[2] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli, "fairseq: A fast, extensible toolkit for sequence modeling," in Proc. NAACL, 2019.

[3] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," in Proc. NeurIPS, 2020.

[4] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, "HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions of Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021

[5] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-end speech processing toolkit," in Proc. Interspeech, 2018, pp. 2207–2211

[6] Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang and Yuekai Zhang, "Recent development on ESPnet tookit boosted by Conformer," in Proc. ICASSP, 2021

[7] Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee, and Shinji Watanabe, "An exploratino of self-supervised pretrained representations for end-to-end speech recognition," in Proc. ASRU, 2021

[8] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pan, "Conformer: Convolution-augmented Transformer for speech recognition," in Proc. Interspeech, 2020, pp.5036–5040

chinese_speech_pretrain's People

Contributors

liushixing avatar pengchengguo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinese_speech_pretrain's Issues

如何测试?

请问想看一下提供的预训练模型在中文语音上转换文字的效果该怎么可视化?小白一枚求指教

如何提取音频特征

我刚试了一下实例代码

# fairseq 使用
import torch
import torch.nn.functional as F
import soundfile as sf
from fairseq import checkpoint_utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_path=""
wav_path=""

def postprocess(feats, normalize=False):
    if feats.dim() == 2:
        feats = feats.mean(-1)

    assert feats.dim() == 1, feats.dim()

    if normalize:
        with torch.no_grad():
            feats = F.layer_norm(feats, feats.shape)
    return feats

print("loading model(s) from {}".format(model_path))
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
    [model_path],
    suffix="",
)
print("loaded model(s) from {}".format(model_path))
print(f"normalize: {saved_cfg.task.normalize}")


model = models[0]
model = model.to(device)
model = model.half()
model.eval()

wav, sr = sf.read(wav_path)
feat = torch.from_numpy(wav).float()
feat = postprocess(feat, normalize=saved_cfg.task.normalize)
feats = feat.view(1, -1)
padding_mask = (
    torch.BoolTensor(feats.shape).fill_(False)
)
inputs = {
    "source": feats.half().to(device),
    "padding_mask": padding_mask.to(device),
}

with torch.no_grad():
    logits = model.extract_features(**inputs)
    print(logits)

返回是这样子的
image
不知道是不是把这块的数据提取出来做reshape就可以了

vocab相关

你好,下载的模型中没有提供vocab.json,而且我看config.json中的vocab_size是32,这个应该是英文的词汇表大小吧

有没有更详细的教程

比方说 按步骤执行某种任务样例等. 达到什么效果...这样才会有更多的粉丝.才会更快的发扬光大

HuBERT模型对应的kmeans模型

你好,感谢你们提供的中文语音预训练模型。
请问能否提供训练HuBERT模型quantization用的kmeans模型,以及用作kmeans聚类的特征信息?谢谢。

Is there any usage?

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch
import librosa

wav, sr = librosa.load('/workspaces/speech/download/pengz/wav/PENGZ/PENGZPENGZ00005.wav')

processor = Wav2Vec2Processor.from_pretrained("wav2vec2/chinese-wav2vec2-base")
model = Wav2Vec2Model.from_pretrained("wav2vec2/chinese-wav2vec2-base")

inputs = processor(wav, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
list(last_hidden_states.shape)

Am I use correct? After I ran above code, it gives me error about 'Can't load tokenizer'.

Add WavLM

您好,

这个工作非常的棒,但是我想知道你们是否有将 WavLM 模型添加到该库中的打算?

Error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

When call feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("hubert-large/chinese-hubert-large.pt")

用CTC直接微调效果非常差

在common voice上使用huggingface的Wav2Vec2ForCTC进行微调,模型加载

model = Wav2Vec2ForCTC.from_pretrained(
                                        model_path,
                                        ctc_loss_reduction='mean',
                                        pad_token_id=tokenizer.pad_token_id,
                                        vocab_size=len(tokenizer))
model.freeze_feature_extractor()

loss从160直接干到4就不降了,预测的时候输入一句话,其输出向量每一行都是一样的,也就是说只输出同一个字,如图
train loss
image
eval loss
image
logits
image
pred_id
image
怎么调都无法成功。
但是在common voice上用facebook/wav2vec2-large-xlsr-53微调是有效果的
image

k-means参数的读取

你好,k-means模型中的model.mdl和dict.km.txt怎么读取和使用呢?烦请告知

模型小型化

现阶段Base 模型 对机器性能也很高(考虑到序列长度),有更小的模型,或者有尝试过模型小型化的实验吗。

Problem about time shape

I have tried this code to extract phonetic feature before the last linear

import torch
import torch.nn.functional as F
import soundfile as sf
import librosa
import pandas as pd
import numpy as np
from transformers import (
    Wav2Vec2FeatureExtractor,
    Wav2Vec2ForPreTraining,
    Wav2Vec2Model,
)
mask_prob=0.0
mask_length=10

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('/home/tuht/PAPL-Attention/pretrained_china')
model = Wav2Vec2ForPreTraining.from_pretrained('/home/tuht/PAPL-Attention/pretrained_china')
model = model.to('cuda')
model = model.eval()
newmodel = torch.nn.Sequential(*(list(model.children())[:-4]))
# print(newmodel)
newmodel.to('cuda')
newmodel.eval()

def phonetic_embedding(path):
    wav, sr = librosa.load(path)
    input_values = feature_extractor(wav, return_tensors="pt",sampling_rate = 16000).input_values
    input_values = input_values.to('cuda')
    with torch.no_grad():
        outputs = newmodel(input_values)
        last_hidden_state = outputs.last_hidden_state
    x = last_hidden_state.squeeze(0).detach().cpu().numpy()
    return x

print(phonetic_embedding("/home/tuht/mandarin_acoustic/000100126.WAV").shape)

and the output:

(644, 768)

My audio: 000100126.WAV have duration: 9.36s, sample_rate 16000
As I know, frame is 20ms, so expectation output should be 9.36/0.02-1 =>

(467, 768)

I don't know why it has this different?
Can you explain this? @pengchengguo @LiuShixing
Thank you

Fine-tune with my own dataset, wer is 1

finetune according to this blog: https://huggingface.co/blog/fine-tune-wav2vec2-english,
But this problem arises,Here it is train_state.json:
"best_metric": null,
"best_model_checkpoint": null,
"epoch": 18.9873417721519,
"global_step": 1500,
"is_hyper_param_search": false,
"is_local_process_zero": true,
"is_world_process_zero": true,
"log_history": [
{
"epoch": 6.33,
"learning_rate": 3.49e-05,
"loss": 14.8414,
"step": 500
},
{
"epoch": 6.33,
"eval_loss": NaN,
"eval_runtime": 36.5218,
"eval_samples_per_second": 49.286,
"eval_steps_per_second": 2.054,
"eval_wer": 0.9999660072064722,
"step": 500
},
{
"epoch": 12.66,
"learning_rate": 8.34e-05,
"loss": 0.0,
"step": 1000
},
{
"epoch": 12.66,
"eval_loss": NaN,
"eval_runtime": 36.5778,
"eval_samples_per_second": 49.21,
"eval_steps_per_second": 2.05,
"eval_wer": 0.9999660072064722,
"step": 1000
},
{
"epoch": 18.99,
"learning_rate": 7.562043795620438e-05,
"loss": 0.0,
"step": 1500
},
{
"epoch": 18.99,
"eval_loss": NaN,
"eval_runtime": 36.5273,
"eval_samples_per_second": 49.278,
"eval_steps_per_second": 2.053,
"eval_wer": 0.9999660072064722,
"step": 1500
}
],
"max_steps": 2370,
"num_train_epochs": 30,
"total_flos": 4.074416087326559e+18,
"trial_name": null,
"trial_params": null
}

What could be the problem?

请问如何用 fairseq 训练 wenetspeech

大佬,能把 fairseq 训练 wenetspeech 的脚本、配置文件上传一下吗?我想复现这个过程,但是一些参数不知道应该设为多少,比如 kmeans 聚类时的 cluster 数量,dataset 的 max_tokens 等。

ASR finetune收敛速度问题

首先非常感谢开源了中文预训练版本的wav2vec2.0!我想问一下您这边在做ASR finetune时,大概第几个epoch开始收敛呢?数据量有多大?

Failed to load pretrained model from huggingface

Hi,

I am trying to load pretrained model (Hubert_large) in espnet setup, but I failed.

The steps are listed here:

  1. git clone https://huggingface.co/TencentGameMate/chinese-hubert-large hub/s3prl_cache/chinese-hubert-large
  2. The modify asr_train_config :
num_workers: 8
batch_type: numel
batch_bins: 4000000
accum_grad: 4
max_epoch: 50
patience: none
init: none
best_model_criterion:
-   - valid
    - acc
    - max
keep_nbest_models: 10
freeze_param: [
"frontend.upstream"
]

frontend: s3prl
frontend_conf:
    frontend_conf:
        upstream: hubert_local
        upstream_model_config: "hub/s3prl_cache/chinese-hubert-large/config.json"
        upstream_ckpt: "hub/s3prl_cache/chinese-hubert-large/chinese-hubert-large-fairseq-ckpt.pt"
    multilayer_feature: true

preencoder: linear
preencoder_conf:
    input_size: 1024  # Note: If the upstream is changed, please change this value accordingly.
    output_size: 80

encoder: conformer
encoder_conf:
    output_size: 256
    attention_heads: 4
    linear_units: 2048
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d2
    normalize_before: true
    macaron_style: true
    pos_enc_layer_type: "rel_pos"
    selfattention_layer_type: "rel_selfattn"
    activation_type: "swish"
    use_cnn_module: true

  1. Error Message
Traceback (most recent call last):
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/teinhonglo/espnets/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/teinhonglo/espnets/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/teinhonglo/espnets/espnet/espnet2/tasks/abs_task.py", line 1013, in main
    cls.main_worker(args)
  File "/home/teinhonglo/espnets/espnet/espnet2/tasks/abs_task.py", line 1115, in main_worker
    model = cls.build_model(args=args)
  File "/home/teinhonglo/espnets/espnet/espnet2/tasks/asr.py", line 415, in build_model
    frontend = frontend_class(**args.frontend_conf)
  File "/home/teinhonglo/espnets/espnet/espnet2/asr/frontend/s3prl.py", line 47, in __init__
    self.upstream, self.featurizer = self._get_upstream(frontend_conf)
  File "/home/teinhonglo/espnets/espnet/espnet2/asr/frontend/s3prl.py", line 68, in _get_upstream
    s3prl_upstream = torch.hub.load(
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/hub.py", line 404, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/hub.py", line 433, in _load_local
    model = entry(*args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/s3prl/s3prl/upstream/hubert/hubconf.py", line 27, in hubert_local
    return _UpstreamExpert(ckpt, *args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/s3prl/s3prl/upstream/interfaces.py", line 30, in __call__
    instance = super().__call__(*args, **kwargs)
  File "/home/teinhonglo/espnets/espnet/tools/s3prl/s3prl/upstream/hubert/expert.py", line 42, in __init__
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 421, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 315, in load_checkpoint_to_cpu
    state = torch.load(f, map_location=torch.device("cpu"))
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.
# Accounting: time=17 threads=1

Any suggestion?
Thanks in advance.

您好,改怎么进行微调呢?

非常感谢您提供的模型,我进行微调报错了:
Can't load tokenizer for 'TencentGameMate/chinese-wav2vec2-large'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'TencentGameMate/chinese-wav2vec2-large' is the correct path to a directory containing all relevant tokenizer files.

fairseq和huggingface输出结果不同

我按照您README中提供的代码,分别使用fairseq和huggingface模型测试同一段音频,输出维度一样,但特征值不同,请问一下是什么原因。

关于模型中没有task_cfg、model_cfg、model_weight、dictionaries_symbols这一问题,求大佬解答

大佬您好,很高兴我能看到大佬开源的中文hubert预训练模型,在我看来该开源的模型非常具有学习价值。由于比较缺乏语音识别训练的经验,在我使用大佬给出的espnet文件夹中的配置文件和chinese-hubert-base.pt进行微调尝试时,卡在了asr.sh文件的stage 10的部分。

具体问题是在convert.py文件中报错:
File "/home/oem/workspace/s3prl-master/s3prl/upstream/hubert/convert.py", line 47, in load_converted_model
raise ValueError(
ValueError: /home/oem/workspace/hubert-base/espnet-master/egs2/aishell/asr1/checkpoint_best.pt is not a valid checkpoint since the required key: task_cfg is missing

而我自己查看了一下:chinese-hubert-base.pt模型中并没有task_cfg、model_cfg、model_weight、dictionaries_symbols这些关键词,导致了convert.py文件的报错。
因此猜想,是否是train_asr_conformer_hubert-base.yaml配置文件中的upstream_model_config没有进行书写导致的。还是说为还有其他没有注意到的点呢?
如果大佬能看到,真心希望能得到您的解答,期待得到您的回复。

请问我们在espnet/egs2/aishell/asr1/下使用,报TypeError: wav2vec2_custom() missing 1 required positional argument: 'ckpt'错误,怎么解决,非常感谢!!!

conf/train_asr_conformer_w2v2-base.yaml文件中做修改如下:

network architecture

pretrained model related

freeze_param: [
"frontend.upstream"
]

frontend: s3prl
frontend_conf:
frontend_conf:
upstream: wav2vec2_local
upstream_model_config:
upstream_ckpt: ./chinese-wav2vec2-base.pt
download_dir: ./hub
multilayer_feature: true

preencoder: linear
preencoder_conf:
input_size: 768 # Note: If the upstream is changed, please change this value accordingly.
output_size: 80
其中upstream_ckpt:中的选项后面 ./chinese-wav2vec2-base.pt是您们已经训练好的模型,不知道这样修改得对不对?
非常感谢!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.