GithubHelp home page GithubHelp logo

Comments (22)

nl8590687 avatar nl8590687 commented on May 22, 2024

这个是为了将THCHS30和ST-CMDS两个不同的数据集按比例均匀插入到数据集列表中的,这个"bili"意思就是“比例”,由于THCHS30和ST-CMDS两个数据集的数据比例为1:10,前者1w条,后者10w条。

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

谢谢。
今天下午review时我有另外一个问题,关于 readdata24.py中的data_generator方法
语句:
input_length.append(data_input.shape[0] // 8 + data_input.shape[0] % 8)
这里输入数据的有效长度为什么要除以8?为什么整除和求余相加?

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

除以8是因为模型中用到了池化,三次池化,2^3=8
然后非整除的值会产生向下取整问题,所以再加上对8求的余数

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

关于SpeechModel25.py中在TestModel方法里要调用self.Predict方法,提示没有这个方法,如何解决?

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

另,我在SpeechModel24.py中看到了Predict方法,能否移到SpeechModel25使用?24中Predict方法中有一句:
base_pred = base_pred[:,:,:]
是何含义?

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

SpeechModel25.py中TestModel里面有调用self.Predict()方法的,不可能提示没有这个方法的,您是不是改动了代码了呢?

base_pred = base_pred[:,:,:] 是用来方便裁剪音频特征的,只不过现在没有用到,所以是这么写,用到的时候,会这么写:base_pred = base_pred[:,2:,:]

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

明白。另:我对Predict方法中的如下两句:
r1=K.get_value(r[0][0])
r1=r1[0]

不理解。这两句是什么意思?不是self.Predict方法返回的应该是一个拼音序列吗?还是只返回一个拼音。

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

返回的当然是拼音序列呀,只不过tensorflow计算后返回的就是那样一个结构类型的变量,print一下就知道了,其中最后一个[0]是取概率最大的一个解码序列,下标越靠后,序列的概率越小。

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

问题:请问训多少次,训练多久,可以达到文档中提到的80%的声音到音素准确率?
目前我已在训练集上训了10个epoch左右,每个epoch是指专门遍历了训练集中所有数据(11万)的标准epoch,(batch_size =16, 每个epoch的iterations在6000多步);我还需要训几个epoch?
我是用单gpu p100开始训的,基本上训一个epoch在2小时左右。

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

10个epoch应该不够,50-100个大概就差不多

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

测试结果为空。我调用SpeechModel25.py中的TestModel方法时,每次输出的序列为空(即一个空的列表:[]),不管我训多少steps。事实上,只有刚开始没训练时,直接调用TestModel得到的序列不为空,一旦开始训练,预测输出为空。请问是什么原因?
另,在你的模型定义CreateModel 中,使用了两个model对象: model和model_data。而训练时训的是model,请问训model的过程中model_data中的权重也会同步更新吗?我理解model用来计算损失,model_data用来预测,但这貌似不是keras的通用写法。

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

请看loss的值和训练时测试输出是多少,如果loss很大,而且训练时测试的错误率很高,那么识别不出来很正常。请确保您按照我在readme和wiki里说的步骤正确进行了配置操作。
两个model是会同时更新权重的,因为保存的时候同时保存了两个model,我在网上搜到关于keras使用CTC的时候,都是这么写的,不用CTC的时候我也不这么写。

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

loss一直在将,但错误率始终不降,原因是预测时输出一直为空列表:[] ;我确认是配置正确,否则loss也不可能不降;下一步我该如何查?

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

loss降到多少?降到70左右才能有输出,降到20左右才会有比较高的正确率
请问您训练了多长时间?我之前训练时间有时候长达200+小时的,而且还是用GPU训练的

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024
  1. 能否提供您训好的模型供我试验?在test_mspeech.py中看应该叫“m24\speech_model24_e_0_step_42500_model”,这个文件;
  2. 请看图片image
    我的意思是,不管loss降到多少,理论上每次预测都应该有输出,只是输出的对错,而现在每次预测时调用TestModel方法对四条序列的预测输出都为空,如图所示。从而错误率每次都是100%。
    3.我在没有训练时先调用TestModel方法,预测值不为空,训练了预测值反而为空;可以的话请您提供下您训练好的模型权重文件,我这边训练一次耗时较大。

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

我有训练好的模型啊,请到我GitHub仓库内的releases里面下载获取

releases

您可能对CTC不太了解,在置信度不够的情况下CTC的输出永远为空,我看了您的图片,您的batch_size肯定是太小了,请把batch_size调整为16/32试试

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

在code review中我又有一个问题:
general_function中的file_wav.py中的GetFrequencyFeature3函数,中有一句:
range0_end=int(len(wavsignal[0])/fs*1000-time_window)//10
这一句我知道是计算fft循环终止的次数,但是除以10是为什么?

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

因为时间窗的时移是10ms,你可以在网上看一下生成语谱图的一些相关的资料

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

还有在补全数据feature时长度统一补全到1600,这个“1600”的值的设置是如何定的,我可以改变长度吗?

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

可以

from asrt_speechrecognition.

BNAadministrator3 avatar BNAadministrator3 commented on May 22, 2024

在readdate中的data_generator函数中,为什么要返回一个‘labels’变量?我看这个变量全是零啊,是否可以去掉?

from asrt_speechrecognition.

nl8590687 avatar nl8590687 commented on May 22, 2024

可以,毕竟代码的写法很多,不一定非要这么写

from asrt_speechrecognition.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.