GithubHelp home page GithubHelp logo

sleepwalking / shiro Goto Github PK

View Code? Open in Web Editor NEW
92.0 5.0 9.0 1.01 MB

Phoneme-to-speech alignment toolkit based on liblrhsmm

License: GNU General Public License v3.0

C 62.61% Lua 35.55% Makefile 1.84%

shiro's Introduction

SHIRO

Phoneme-to-Speech Alignment Toolkit based on liblrhsmm

Proudly crafted in C and Lua. Licensed under GPLv3.

Introduction

SHIRO is a set of tools based on HSMM (Hidden Semi-Markov Model), for aligning phoneme transcription with speech recordings, as well as training phoneme-to-speech alignment models.

Gathering hours of speech data aligned with phoneme transcription is, in most approaches to this date, an important prerequisite to training speech recognizers and synthesizers. Typically this task is automated by an operation called forced alignment using hidden Markov models and in particular, the HTK software bundle has become the standard baseline method for both speech recognition and alignment since mid 90s.

SHIRO presents a lightweight alternative to HTK under a more permissive license. It is like a stripped-down version that only does phoneme-to-speech alignment, but equipped with HSMM and written from scratch in a few thousand lines of rock-solid C code (plus a bit of Lua).

A little bit of history

SHIRO is a sister project of liblrhsmm whose first version was developed over summer back in 2015. SHIRO was initially part of liblrhsmm and later it was merged into Moresampler. Before turned into a toolkit, SHIRO supported flat-start training only, which was why it got the name SHIRO (meaning "white" in Japanese).

HSMM Primer

It is good to have some basic concepts with Hidden Semi-Markov Models when working with SHIRO.

One way to understand HSMM is through Mario analogy. We have a super Mario setup with a flat map and a bunch of question blocks on the top,

Let's say each of the block contains a different hidden item. It could be a coin; it could be a mushroom. The items hidden in the first few blocks have a higher probability to be coins and the final few blocks are more likely to be mushrooms.

Each time Mario walks to the right by some random number of steps and then he jumps and hit one of the blocks, and the block is going to release an item.

Now the question is, Mario has walked through this map from left to right and we are given a bunch of items the blocks have released (sorted in the original order), can we infer at which places did Mario jump?

And this is the typical kind of HSMM problem we're dealing with. We're essentially aligning a sequence of items with a sequence of possible jump positions.

In the context of phoneme-to-speech alignment, Mario is hopping through a bunch of phonemes with some unknown duration, and when he passes through a phoneme, there's going to be some sound wave (of pronuncing the phoneme) emitted. We know what phonemes we have, and we have the entire sound file. The problem is to locate the position, which includes the beginning and ending of each phoneme.

The HSMM terminology for describing such problem is: each hopping interval is a hidden state. During a state, an output is emitted according to some probability distribution associated with the state. The duration of a state is also governed by a probability distribution. And there are two things we can do:

  1. Inference. Given an output sequence and a state sequence, determine the most probable time that each state begins/ends.
  2. Training. Given an output sequence, a state seqeunce and the associated time sequence, find the probability distributions governing the state duration and emission of outputs.

Speech, as a continuous process, has to be chopped into short pieces to fit into the HSMM paradigm. This is done in the feature extraction stage, where the input speech is analyzed and features are extracted every 5 or 10 milliseconds. The features are condensed data describing how the input sounds like at a particular time. Also in practice the mapping from phonemes to states is not one-to-one, because many phonemes have rich time structure more than what a single state can model. We usually assign 3 to 5 states to each phoneme.

Components

SHIRO consists of the following tools,

Tool Description Input(s) Output(s)
shiro-mkhsmm model creation tool model config. model
shiro-init model initialization tool model, segmentation model
shiro-rest model re-estimation (a.k.a. training) tool model, segmentation model
shiro-align aligner (using a trained model) model, segmentation segmentation (updated)
shiro-untie a tool for untying monophone models model, segmentation model, segmentation
shiro-wav2raw utility for converting .wav files into float binary blobs .wav file .raw file
shiro-xxcc a simple cepstral coefficients extractor .raw file parameter file
shiro-fextr.lua a feature extractor wrapper directory parameter files
shiro-mkpm.lua utility for phonemap creation phoneset phonemap
shiro-pm2md.lua utility for creating model definition from phonemap phonemap model def.
shiro-mkseg.lua utility for creating segmentation file from .csv table .csv file segmentation
shiro-seg2lab.lua utility for converting segmentation file into Audacity label segmentation Audacity label files
shiro-lab2seg.lua utility for converting Audacity label into segmentation files Audacity label files, .csv index segmentation
shiro-wavsplit.lua a Lua script for utterance-level segmentation .wav file segmentation, Audacity label file, model

Run them with -h option for the usage.

Building

ciglet and liblrhsmm are the only library dependencies. You also need lua (version 5.1 or above) or luajit. No 3rd party lua library (besides those included in external/ already) is needed.

  • cd into ciglet, run make single-file. This creates ciglet.h and ciglet.c under ciglet/single-file/. Copy and rename this directory to shiro/external/ciglet.
  • Put liblrhsmm under shiro/external/ and run make from shiro/external/liblrhsmm/.
  • Finally run make from shiro/.

For your information, the directory structure should look like

  • shiro/external/
    • ciglet/
      • ciglet.h
      • ciglet.c
    • liblrhsmm/
      • a bunch of .c and .h
      • Makefile, LICENSE, readme.md, etc.
      • external/, test/, build/
    • cJSON/
    • dkjson.lua, getopt.lua, etc.

Getting Started

The following sections include examples based on CMU Arctic speech database.

Create model and (Arpabet) phoneme definitions for American English

The entire framework is in fact language-oblivious (because the mapping between phoneme and features is data-driven).

That being said, to use SHIRO on any language of your choice, simply replace arpabet-phoneset.csv by another list of phonemes.

lua shiro-mkpm.lua examples/arpabet-phoneset.csv \
  -s 3 -S 3 > phonemap.json
lua shiro-pm2md.lua phonemap.json \
  -d 12 > modeldef.json

Align phonemes and speech using a trained model

First step: feature extraction. Input waves are downsampled to 16000 Hz sample rate and 12-order MFCC with first and second-order delta features is extracted.

lua shiro-fextr.lua index.csv \
  -d "../cmu_us_bdl_arctic/orig/" \
  -x ./extractors/extractor-xxcc-mfcc12-da-16k -r 16000

Second step: create a dummy segmentation from the index file.

lua shiro-mkseg.lua index.csv \
  -m phonemap.json \
  -d "../cmu_us_bdl_arctic/orig/" \
  -e .param -n 36 -L sil -R sil > unaligned.json

Third step: since the search space for HSMM is an order of magnitude larger than HMM, it's more efficient to start from a HMM-based forced alignment, then refine the alignment using HSMM in a pruned search space. When running HSMM training, SHIRO applies such pruning by default. You may need to increase the search space (-p 10 -d 50) a bit to avoid alignment errors caused by a narrowed search space, although this will make it run slower. A rule of thumb on choosing p is to multiply the average number of states in a file by 0.1. For example, if on average an audio file contains 30 phonemes and each phoneme has 5 states, p should be 30 * 5 * 0.1 = 15. If you're doing alignment straight from HSMM, the factor would be around 0.2.

./shiro-align \
  -m trained-model.hsmm \
  -s unaligned.json \
  -g > initial-alignment.json
./shiro-align \
  -m trained-model.hsmm \
  -s initial-alignment.json \
  -p 10 -d 50 > refined-alignment.json

Final step: convert the refined segmentation into label files.

lua shiro-seg2lab.lua refined-alignment.json -t 0.005

.txt label files will be created under ../cmu_us_bdl_arctic/orig/.

Train a model given speech and phoneme transcription

(Assuming feature extraction has been done.)

First step: create an empty model.

./shiro-mkhsmm -c modeldef.json > empty.hsmm

Second step: initialize the model (flat start initialization scheme).

lua shiro-mkseg.lua index.csv \
  -m phonemap.json \
  -d "../cmu_us_bdl_arctic/orig/" \
  -e .param -n 36 -L sil -R sil > unaligned-segmentation.json
./shiro-init \
  -m empty.hsmm \
  -s unaligned-segmentation.json \
  -FT > flat.hsmm

Third step: bootstrap/pre-train using the HMM training algorithm and update the alignment accordingly.

./shiro-rest \
  -m flat.hsmm \
  -s unaligned-segmentation.json \
  -n 5 -g > markovian.hsmm
./shiro-align \
  -m markovian.hsmm \
  -s unaligned-segmentation.json \
  -g > markovian-segmentation.json

Final step: train the model using the HSMM training algorithm.

./shiro-rest \
  -m markovian.hsmm \
  -s markovian-segmentation.json \
  -n 5 -p 10 -d 50 > trained.hsmm

Using SPTK in place of shiro-xxcc

SHIRO's feature files are binary-compatible with the float blob generated from SPTK, which allows the user to experiment with a plethora of feature types that shiro-xxcc do not support. An example of extracting MFCC with SPTK is given in extractors/extractor-sptk-mfcc12-da-16k.lua,

return function (try_excute, path, rawfile)
  local mfccfile = path .. ".mfcc"
  local paramfile = path .. ".param"
  try_execute("frame -l 512 -p 80 \"" .. rawfile .. "\" | " ..
    "mfcc -l 512 -m 12 -s 16 > \"" .. mfccfile .. "\"")
  try_execute("delta -l 12 -d -0.5 0 0.5 -d 0.25 0 -0.5 0 0.25 \"" ..
    mfccfile .. "\" > \"" .. paramfile .. "\"")
end

Any Lua file that takes the rawfile and outputs a .param file will work.

Note: parameters generated from shiro-xxcc are not guaranteed to match the result from SPTK even under the same configuration.

Advanced Topics

Skippable phonemes

In certain occasions there could be slight mismatches between the speech and its phoneme transcription. One of the most common cases is the insertion of pauses between words or phrases. To correct this mismatch we can add a pause phoneme ("pau" in Arpabet, for example) at every word and phrase boundary, and make such phonemes skippable by specifying a skipping probability between 0 and 1 in the phonemap,

    ...
    "pau":{
      "pskip":0.5,
      "states":[{
          "dur":0,
          "out":[0,0,0]
        },{
          "dur":1,
          "out":[1,1,1]
        },{
          "dur":2,
          "out":[2,2,2]
        },{

Then shiro-mkseg.lua will add a skip transition across all the states in phoneme "pau" whenever it appears in the segmentation file. The skip transition can be visualized as,

Alternative intra-phoneme topologies

The states within a phoneme can also be skipped via topology specification in the phonemap, such as,

    ...
    "pau":{
      "topology":"type-b",
      "states":[{
          "dur":0,
          "out":[0,0,0]
        },{
          "dur":1,
          "out":[1,1,1]
        },{
          "dur":2,
          "out":[2,2,2]
        },{

The default topology is type-a, which means there's no skip at all, and it works well for most of the time.

Other options include

  • type-b type-b
  • type-c type-c
  • skip-boundary skip-boundary

DAEM training

DAEM (DorAEMon Deterministic Annealing Expectation-Maximization) is a modified version of the standard HSMM training algorithm. In DAEM training the log probabilities are scaled by a temperature coefficient that gradually converges from 0 to 1 throughout the iterations. It has been reported in the literatures that DAEM improves the accuracy of flat-start-trained HMM speech recognition systems.

To enable DAEM for shiro-rest, simply add -D option. The displayed log likelihood will be adjusted against temperature.

shiro's People

Contributors

sleepwalking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

shiro's Issues

lua5.2: shiro-fextr.lua:54: module '/home/___/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-dae-16k.lua' not found:

Hi,

Building on linux, I'm encountering a problem running SHIRO.

I've tried with adding the .lua to the extrator as well but I get the same error.

lua5.2 shiro-fextr.lua ~/Downloads/UTAU/Resonance_Harmony_Arpasing_English/Base_B3/index.csv -d ~/Downloads/UTAU/Resonance_Harmony_Arpasing_English/Base_B3/ -x ~/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k -r 16000
lua5.2: shiro-fextr.lua:54: module '/home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k' not found:
no field package.preload['/home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k']
no file '/usr/local/share/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.lua'
no file '/usr/local/share/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k/init.lua'
no file '/usr/local/lib/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.lua'
no file '/usr/local/lib/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k/init.lua'
no file '/usr/share/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.lua'
no file '/usr/share/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k/init.lua'
no file './/home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.lua'
no file '/usr/local/lib/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.so'
no file '/usr/lib/x86_64-linux-gnu/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.so'
no file '/usr/lib/lua/5.2//home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.so'
no file '/usr/local/lib/lua/5.2/loadall.so'
no file './/home/myname/Prog/audio/SHIRO/extractors/extractor-xxcc-mfcc12-da-16k.so'
stack traceback:
[C]: in function 'require'
shiro-fextr.lua:54: in main chunk
[C]: in ?

content of index.csv

Hi,
lua shiro-fextr.lua index.csv -d "../cmu_us_bdl_arctic/orig/" -x ./extractors/extractor-xxcc-mfcc12-da-16k -r 16000
Can you tell what is the content of index.csv file which is one of the input argument for speech-phoneme alignment.
Also what path should be provided for -d argument

Thanks

Supported audio/index length?

Hello,

I am building a dataset to train with and need to ask a few questions before proceeding.

What is the max supported/suggested audio length? is several minutes alright or should the audio be limited to about ~20 seconds or so? Likewise, is there a reasonable limit to the length of the index?

Thank you.

when load the model, Null point always returned.

hello, first thanks for the nice framework.

the extraction of mfcc and first, second-order delta feature works well.
After that, when I load the model(.hsmm)

the Error: failed to load model from blah blah

.. error is occurred.

Some model file(empty.hsmm) doesn't occur above error.
And i made some test.txt or text.hsmm file and change the from path to test file to check the fopen function in hsmm = load_model(optarg) in shiro-rest.c whether it works well. But it also got an error!

fopen return success by checking 'perror', it returns 'Success'. the custom c file i made also can read any .hsmm and test.txt.
but it doesn't works only in your shiro-rest.c code.

I can't resolve this situation, how can i resolve this problem?

image

Phonetic stress without creating new phonemes?

I am looking to use Shiro to label speech with the stresses in-place. Does Shiro have support for this without treating them as a unique phoneme?

If not then would it be ok to request this as a feature? Being able to do something like ah durfloor 0.4 aka ah0 aka ah1 as to not waste data but still output the stress in the final label would be very useful.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.