ashwinijk / pyhtk Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/pyhtk
Automatically exported from code.google.com/p/pyhtk
----------------------------------------------------------------------------------- -- PYHTK -- -- A Python package for building GMM-HMM models for speech recognition using HTK -- -- -- -- Initial code written by: Daniel Gillick ([email protected]) -- -- -- -- Code (.py files) licensed under the New BSD License -- -- (http://www.opensource.org/licenses/bsd-license.php) -- ----------------------------------------------------------------------------------- To create a model: 1. Use make_setup.py or your own script to create a setup file. Look at the examples in the Setups directory. Each line consists of an audio file, its transcription, and a config file used to process the audio. 2. Create a config file, using the examples in the Configs directory as templates. Configs/si84.config is a training config, while Configs/nov92.config is a testing config. See model.py to understand in more detail what each variable in the config file means. 3. Put a pronunciation dictionary in the Common directory. The CMU dictionary is available here: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.7a Make sure your config file references this file. 4. Make sure the project dependencies are setup properly. - Python 2.5+ - HTK 3.4 - SRILM - sph2pipe You should be able to run these tools from the command line, so make sure they're in your path. 5. Run model.py to build a model. For example: python model.py Configs/si84.config 6. Test your model. For example: python test.py -g 8 -i 6 Configs/si84.config Configs/nov92.config > gives WER: 13.55 python test.py -g 8 -i 6 -m Configs/si84.config Configs/nov92.config > gives WER: 12.65 Note that by default, the testing code is ignoring over 100 of the test utterances (330 -> 216) because they contain at least one word that wasn't in the training data. The WSJ corpus ships with a standard 5k dictionary and LM. Using these, the WER is 8.59 with 8 MLE-trained Gaussians, and 7.81 using MPE. -------------------------- Copyright (c) 2011, Daniel Gillick All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the International Computer Science Institute nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL DANIEL GILLICK BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Add code to modify a cmu dict to work with HTK; once this is done, remove the
dict provided in Common/
Original issue reported on code.google.com by [email protected]
on 6 Oct 2011 at 7:04
Need better information in the logs:
- use alignment to get amount of time in each phone
- diagnostics for disc. training
- config options to keep more intermediate files
Original issue reported on code.google.com by [email protected]
on 11 Oct 2011 at 5:38
Here is what I think we should use for the phone marking using HVite. The '-n
32 -m' is main difference between the HDecode.mod.
Note that unlike HDecode, HVite needs the training dictionary that has sp/sil
on each entry, but you also need to add entries for <s> and </s>, which I did
in the dict in the command line
HVite -A -D -V -T 9 -n 32 -m -w -q tvaldm -z lat -X lat -C
/n/shokuji/da/swegmann/work/lats/hvite/config.hvite -H
exp/si84/0/Xword/HMM-8-6/MMF -t 200.0 -s 15.0 -p 0.0 -S
/n/shokuji/da/swegmann/work/lats/hvite/mfc.list -l
/n/shokuji/da/swegmann/work/lats/hvite/den -L
exp/si84/0/MMI/Denom/Lat_prune/404/ /n/shokuji/da/swegmann/work/lats/hvite/dict
exp/si84/0/tied.list
I don't think that this is super expensive to run either, since most of the
compute is in the actual recog which HDecode still does. Maybe you could just
link in the word/pruned lats and then just run the phone marking? It should be
super fast...
Original issue reported on code.google.com by [email protected]
on 11 Oct 2011 at 1:05
I believe that standard AM training as it's done at Cambridge within the last 6
years goes something like this:
Flat start monos
Mix-up monos
Mixdown monos
Estimate untied xwrd
Tie xwrd
Mix-up xwrd
Re-estimate xwrd using two-model re-estimation with the mixdown monos & last
xwrd
Re-align to get interword sil & alternate prons Re-estimate xwrd using
two-model re-estimation with the mixdown monos and last xwrd
(repeat a bunch of times)
To train the monophones you use the same recipe as before to do the flat start.
From the flat start you mix-up to 16 components using the same schedule as
before(2 4 6 8 10 12 14 16), and just as before you ask fo twice as many sil
comps as non-sil, eg, at 8
MU 16 {(sil,sp).state[2-4].mix}
MU 8 {*.state[2-4].mix}
Amusingly this works, because first you mix the sils up to 16 and next you go
over all the models asking for 8. Since sil has 8 it doesn't do anything.
To train the xwrd models we need to mix the non-sil models down to one comp and
sil down to 12.
The hed command is (mixdown.hed)
MD 12 {(sil,sp).state[2-4].mix}
MD 1 {(ah, ax, all the rest of the non-sil phones).state[2-4].mix}
Maybe its better to do it line by line for each non sil command
MD 12 {(sil,sp).state[2-4].mix}
MD 1 {ah.state[2-4].mix}
MD 1 {ax.state[2-4].mix}
...
Either way, you run
HHEd -A -D -T 1 -H mono_mix/MMF -w mono_seed/MMF mixdown.hed mono.list
Now that you have the these models you proceed exactly as before, with triphone
cloning, untied estimation, tying, etc.
Another good thing to do after state tying and the you have better models to
re-estimate the variance floor. So in the first mixup you re-estimate the
variance using this mixup.2.hed:
LS final-unimodal-tied-triphone-dir/stats
FA 0.1
MU 4 {(sil,sp).state[2-4].mix}
MU 2 {*.state[2-4].mix}
Just to be clear: you only do this once, in the first mixup for the tied
triphones. This will make the varfloor bigger than the initial estimate using
the total variance of the data, i.e. conservative, which is a good thing.
Restimating the xwrd models from seed monos and extant xwrd models:
You use the mixdown monos in the same way as before (don't need to redo), clone
as before (maybe recycle these). But in the first pass of estimation for the
untied triphone models you use two model re-estimation. the HERest command
looks exactly the same as it did before, but you add:
ALIGNMODELMMF = previous 16 comp xwrd/MMF
ALIGNHMMLIST = previous xwrd model list
This uses the previous big models to get the BW alignment. You only do it in
this one pass. After you have this you proceed as before, including
re-estimating the var floor in the first mix up
LS final-unimodal-tied-triphone-dir/stats
FA 0.1
MU 4 {(sil,sp).state[2-4].mix}
MU 2 {*.state[2-4].mix}
I'm pretty certain that Cambrideg only does one pass of BW on the untied xword
models. I'm not certain what I told you before...
Original issue reported on code.google.com by [email protected]
on 11 Oct 2011 at 1:04
- Make config setup more modular (especially discriminative training)
Original issue reported on code.google.com by [email protected]
on 6 Oct 2011 at 7:02
Section 3.7 of the HTK book has a pretty good explanation of how to do this.
The seed models for the process are fully trained mixture models.
First you need to create a "base class", this says which components are
associated with which transforms. To start, we'll use just one tansformation,
so this file is trivial, call it "global" and maybe put in a directory called
misc/trans. The only thing dangerous about this is the 32 comp max. Should
prabably set this a build time to be the max ncomps in the mixture models:
~b "global"
<MMFIDMASK> *
<PARAMETERS> MIXBASE
<NUMCLASSES> 1
<CLASS> 1 {*.state[2-4].mix[1-32]}
To estimate a diagonalizing transform (semi-tied covariance in HTK parlance)
you use HERest almost exactly as you would for one pass of BW. To the usual BW
config file add the lines
HADAPT:TRANSKIND = SEMIT
HADAPT:USEBIAS = FALSE
HADAPT:BASECLASS = global
HADAPT:SPLITTHRESH = 0.0
HADAPT:MAXXFORMITER = 100
HADAPT:MAXSEMITIEDITER = 20
HADAPT:TRACE = 61
HMODEL:TRACE = 512
HADAPT: SEMITIED2INPUTXFORM = TRUE
In both the scatter and gather steps add
-J misc/trans -u stw
to the usual HERest commands (I'm assuming we don't set -u normally, if we do
change it to this). The resulting MMF has the transformation saved within it.
After this, you run 6 passes of BW using the usual commands and config. Not
certain what the directory structure should look like? Maybe have a Diag dir
parallel to the Mono, Xword, etc dirs and inside that the diag estimation takes
place in HMM-16-0, and the BW passes in HMM-16-1, ...? Recognition with
HDecode works exactly as before too.
Original issue reported on code.google.com by [email protected]
on 11 Oct 2011 at 1:03
Currently, we have a number of specific configs in the Common/ directory; these
can be created on the fly according to options specified in the master config
Original issue reported on code.google.com by [email protected]
on 6 Oct 2011 at 7:03
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.