GithubHelp home page GithubHelp logo

pyhtk's Introduction

-----------------------------------------------------------------------------------
-- PYHTK                                                                         --
-- A Python package for building GMM-HMM models for speech recognition using HTK --
--                                                                               --
-- Initial code written by: Daniel Gillick ([email protected])                  --
--                                                                               --
-- Code (.py files) licensed under the New BSD License                           --
--    (http://www.opensource.org/licenses/bsd-license.php)                       --
-----------------------------------------------------------------------------------

To create a model:

1. Use make_setup.py or your own script to create a setup file. Look at the
examples in the Setups directory. Each line consists of an audio file, its
transcription, and a config file used to process the audio.

2. Create a config file, using the examples in the Configs directory as
templates. Configs/si84.config is a training config, while Configs/nov92.config
is a testing config. See model.py to understand in more detail what each
variable in the config file means.

3. Put a pronunciation dictionary in the Common directory. The CMU dictionary is
available here:
https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.7a
Make sure your config file references this file.

4. Make sure the project dependencies are setup properly.
  - Python 2.5+
  - HTK 3.4
  - SRILM
  - sph2pipe

You should be able to run these tools from the command line, so make sure they're
in your path.

5. Run model.py to build a model. For example:

python model.py Configs/si84.config

6. Test your model. For example:

python test.py -g 8 -i 6 Configs/si84.config Configs/nov92.config
> gives WER: 13.55

python test.py -g 8 -i 6 -m Configs/si84.config Configs/nov92.config
> gives WER: 12.65

Note that by default, the testing code is ignoring over 100 of the test utterances
(330 -> 216) because they contain at least one word that wasn't in the training
data.
The WSJ corpus ships with a standard 5k dictionary and LM. Using these, the
WER is 8.59 with 8 MLE-trained Gaussians, and 7.81 using MPE.





--------------------------
Copyright (c) 2011, Daniel Gillick
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * Neither the name of the International Computer Science Institute nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL DANIEL GILLICK BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

pyhtk's People

Watchers

 avatar

pyhtk's Issues

fix cmu dict

Add code to modify a cmu dict to work with HTK; once this is done, remove the 
dict provided in Common/

Original issue reported on code.google.com by [email protected] on 6 Oct 2011 at 7:04

Add better logging

Need better information in the logs:
- use alignment to get amount of time in each phone
- diagnostics for disc. training
- config options to keep more intermediate files

Original issue reported on code.google.com by [email protected] on 11 Oct 2011 at 5:38

Include option to use HVite for lattices in disc. training

Here is what I think we should use for the phone marking using HVite.  The '-n 
32 -m' is main difference between the  HDecode.mod.

Note that unlike HDecode, HVite needs the training dictionary that has sp/sil 
on each entry, but you also need to add entries for <s> and </s>, which I did 
in the dict in the command line

HVite -A -D -V -T 9 -n 32 -m -w -q tvaldm -z lat -X lat -C 
/n/shokuji/da/swegmann/work/lats/hvite/config.hvite -H 
exp/si84/0/Xword/HMM-8-6/MMF -t 200.0 -s 15.0 -p 0.0 -S 
/n/shokuji/da/swegmann/work/lats/hvite/mfc.list -l 
/n/shokuji/da/swegmann/work/lats/hvite/den -L 
exp/si84/0/MMI/Denom/Lat_prune/404/ /n/shokuji/da/swegmann/work/lats/hvite/dict 
exp/si84/0/tied.list

I don't think that this is super expensive to run either, since most of the 
compute is in the actual recog which HDecode still does.  Maybe you could just 
link in the word/pruned lats and then just run the phone marking?  It should be 
super fast...

Original issue reported on code.google.com by [email protected] on 11 Oct 2011 at 1:05

Add improved training recipe

I believe that standard AM training as it's done at Cambridge within the last 6 
years goes something like this:

Flat start monos
Mix-up monos
Mixdown monos
Estimate untied xwrd
Tie xwrd
Mix-up xwrd
Re-estimate xwrd using two-model re-estimation with the mixdown monos & last 
xwrd
Re-align to get interword sil & alternate prons  Re-estimate xwrd using 
two-model re-estimation with the mixdown monos and last xwrd
(repeat a bunch of times)

To train the monophones you use the same recipe as before to do the flat start.

From the flat start you mix-up to 16 components using the same schedule as 
before(2 4 6 8 10 12 14 16), and just as before you ask fo twice as many sil 
comps as non-sil, eg, at 8

MU 16 {(sil,sp).state[2-4].mix}
MU 8 {*.state[2-4].mix}

Amusingly this works, because first you mix the sils up to 16 and next you go 
over all the models asking for 8.  Since sil has 8 it doesn't do anything.

To train the xwrd models we need to mix the non-sil models down to one comp and 
sil down to 12.

The hed command is (mixdown.hed)

MD 12 {(sil,sp).state[2-4].mix}
MD 1 {(ah, ax, all the rest of the non-sil phones).state[2-4].mix}

Maybe its better to do it line by line for each non sil command

MD 12 {(sil,sp).state[2-4].mix}
MD 1 {ah.state[2-4].mix}
MD 1 {ax.state[2-4].mix}
...

Either way, you run

HHEd -A -D -T 1 -H mono_mix/MMF  -w mono_seed/MMF mixdown.hed mono.list


Now that you have the these models you proceed exactly as before, with triphone 
cloning, untied estimation, tying, etc.

Another good thing to do after state tying and the you have better models to 
re-estimate the variance floor. So in the first mixup you re-estimate the 
variance using this mixup.2.hed:

LS final-unimodal-tied-triphone-dir/stats
FA 0.1
MU 4 {(sil,sp).state[2-4].mix}
MU 2 {*.state[2-4].mix}

Just to be clear: you only do this once, in the first mixup for the tied 
triphones.  This will make the varfloor bigger than the initial estimate using 
the total variance of the data, i.e. conservative, which is a good thing.

Restimating the xwrd models from seed monos and extant xwrd models:
You use the mixdown monos in the same way as before (don't need to redo), clone 
as before (maybe recycle these).  But in the first pass of estimation for the 
untied triphone models you use two model re-estimation.  the  HERest command 
looks exactly the same as it did before, but you add:
ALIGNMODELMMF = previous 16 comp xwrd/MMF
ALIGNHMMLIST  = previous xwrd model list

This uses the previous big models to get the BW alignment.  You only do it in 
this one pass.  After you have this you proceed as before, including 
re-estimating the var floor in the first mix up
LS final-unimodal-tied-triphone-dir/stats
FA 0.1
MU 4 {(sil,sp).state[2-4].mix}
MU 2 {*.state[2-4].mix}

I'm pretty certain that Cambrideg only does one pass of BW on the untied xword 
models.  I'm not certain what I told you before...

Original issue reported on code.google.com by [email protected] on 11 Oct 2011 at 1:04

Config files

- Make config setup more modular (especially discriminative training)


Original issue reported on code.google.com by [email protected] on 6 Oct 2011 at 7:02

Add diagonalization transform code

Section 3.7 of the HTK book has a pretty good explanation of how to do this.  
The seed models for the process are fully trained mixture models.

First you need to create a "base class", this says which components are 
associated with which transforms.  To start, we'll use just one tansformation, 
so this file is trivial, call it "global" and maybe put in a directory called 
misc/trans.  The only thing dangerous about this is the 32 comp max.  Should 
prabably set this a build time to be the max ncomps in the mixture models:

~b "global"
<MMFIDMASK> *
<PARAMETERS> MIXBASE
<NUMCLASSES> 1
<CLASS> 1 {*.state[2-4].mix[1-32]}

To estimate a diagonalizing transform (semi-tied covariance in HTK parlance) 
you use HERest almost exactly as you would for one pass of BW.  To the usual BW 
config file add the lines

HADAPT:TRANSKIND = SEMIT
HADAPT:USEBIAS = FALSE
HADAPT:BASECLASS = global
HADAPT:SPLITTHRESH = 0.0
HADAPT:MAXXFORMITER = 100
HADAPT:MAXSEMITIEDITER = 20
HADAPT:TRACE = 61
HMODEL:TRACE = 512
HADAPT: SEMITIED2INPUTXFORM = TRUE

In both the scatter and gather steps add

-J misc/trans -u stw

to the usual HERest commands (I'm assuming we don't set -u normally, if we do 
change it to this).  The resulting MMF has the transformation saved within it.  
After this, you run 6 passes of BW using the usual commands and config.  Not 
certain what the directory structure should look like?  Maybe have a Diag dir 
parallel to the Mono, Xword, etc dirs and inside that the diag estimation takes 
place in HMM-16-0, and the BW passes in HMM-16-1, ...?  Recognition with 
HDecode works exactly as before too.

Original issue reported on code.google.com by [email protected] on 11 Oct 2011 at 1:03

On-the-fly HTK configs

Currently, we have a number of specific configs in the Common/ directory; these 
can be created on the fly according to options specified in the master config

Original issue reported on code.google.com by [email protected] on 6 Oct 2011 at 7:03

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.