Speech Decoding Engines Part 1. Juicer, the WFST recognizer

ASR today is quite diverse. While in 1998 there was only a HTK package and some inhouse toolkits like CMUSphinx released in 2000, now there are dozen very interesting recognizers released to the public and available under open source licenses. We are starting today the review series about them.

So, the first one is Juicer, the WSFT recognizer from IDIAP, University of Edinburgh and University of Sheffield.

Weighted finite state transducers (WFST) are very popular trend in modern ASR, with very famous addicts like Google, IBM Watson center and so many others. Basically the idea is that you convert everything into same format that allows not just unified representation, but more advanced operations like merging to build a shared search space or reducing to make that search space smaller (with operations like determinization and minimization). Format also provides you interpolation properties, for example you don't need to care about g2p anymore, it's automatically done by transducer. For WFST itself, I found a good tutorial by Mohril, Pereira and Riley "Speech Recognition with Weighted State Transducers".

Juicer can do very efficient decoding with stanard set of ASR tools - ARPA language model (bigram due to memory requirements), dictionary and cross-word triphone models could be trained by HTK. BSD license makes Juicer very attractive. Juicer is part of AMI project that targets meeting transcription, other AMI deliverables are subject for separate posts though.

So here is the description how to try it. Don't expect it to be straightforward though, it's not a trivial process. Well, one day we'll put everything on a live CD to make ASR development environment easier. Right now you can follow this step-by-step howto as many our young friends call such thing. I wonder where do people get the idea that for everything there is detailed step-by-step howto.

So, let's start Download Juicer and dependencies:


Unpack and build torch

tar xf Torch3src.tgz
cd Torch3
cp config/Linux.cfg .

Edit Linux.cfg to include packages: distributions gradients kernels
speech datasets decoder

# Packages you want to use
packages = distributions gradients kernels speech datasets decoder

Continue with the build

./xmake all
cd ..

Unpack kiss_fft:

tar xf kiss_fft-v1.2.8.tar.gz

There is no need to build kiss, it's build is included in the next step.

Unpack and build tracter

tar xf tracter-0.6.0.tar.bz2
cd tracter-0.6.0
aclocal && libtoolize && automake -a && autoconf
mkdir m4
./configure \
  --with-kiss-fft=/current_folder/kiss_fft_v1_2_8 \
  --with-htk-includes="-I/htk_folder/HTKLib" \
make && make install
cd ..

Make sure you point full path to the dependencies, since relative path
will not work. Also note that for htk you need to provide compiler
options, not folders. Alternatively you can increase your pain trying
to build tracter with cmake as readme describes.

Unpack and build juicer

Make sure PKG_CONFIG_PATH makes tracter.pc reachable.

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig
tar xf juicer-0.12.0.tar.bz2
cd juicer-0.12.0
aclocal && libtoolize && automake -a && autoconf
mkdir m4
./configure \
  --with-kiss-fft=/current_folder/kiss_fft_v1_2_8 \
  --with-htk-includes="-I/htk_folder/HTKLib" \
make && make install
cd ..

Build openfst:

tar xf openfst-1.1.tgz
cd openfst
./configure && make && make install
cd ../..

Setup environment variables:

export JUTOOLS=/current_folder/juicer-0.12.0/bin

At this point juicer and required tools are built, let's try it with HTK
wsj model from Keith Vertanen. Download the model htk_wsj_si84_2750_8.zip
and unpack it

unzip htk_wsj_si84_2750_8.zip

Convert model to ascii

mkdir ascii
touch empty
HHEd -D -T 1 -H hmmdefs -H macros -M ascii empty tiedlist

Convert dmp turtle model from pocketsphinx to ARPA model turtle.lm

sphinx_lm_convert -i turtle.DMP -o turtle.lm -ifmt dmp -ofmt arpa

Remove alternative pronunciation numbers from turtle.dic and build phoneset

sed 's/([0-9])//g' turtle.dic | tr [:upper:] [:lower:] > turtle.dic.lower
mv turtle.dic.lower turtle.dic
echo "<s>   sil" >> turtle.dic
echo "</s>   sil" >> turtle.dic
for w in `cat turtle.dic | cut -d" " -f 2-`; do echo $w; done | sort | uniq > turtle.phone
echo sp >> turtle.phone

Due to some script limitations, not all different words couldn't have same pronunciations. So open turtle.dic and remove line with entry "two t uw" because it conflicts with "to t uw"

Now let's convert everything into WFST

gramgen -gramType ngram -lmFName turtle.lm -lexFName turtle.dic \
-fsmFName gram.fsm -inSymsFName gram.insyms -outSymsFName gram.outsyms \
-sentStartWord "<s>" -sentEndWord "</s>"

lexgen -lexFName turtle.dic  -monoListFName turtle.phone \
-fsmFName dic.fsm -inSymsFName dic.insyms -outSymsFName dic.outsyms \
-sentStartWord "<s>" -sentEndWord "</s>" -pauseMonphone sp -addPronunsWithEndPause

cdgen -htkModelsFName wsj_si84_2750_8/ascii/hmmdefs -tiedListFName \
wsj_si84_2750_8/tiedlist  -monoListFName turtle.phone -fsmFName wsj.fsm \
-inSymsFName wsj.insyms -outSymsFName wsj.outsyms

To deal with juicer bug comment the following lines in juicer-0.12.0/bin/aux2eps.pl:

#if ( ! %AUXSYMS )
#   print "no aux syms in symbol file - nothing to do\n" ;
#   exit 0 ;

Now let's compose it into single WFST

build-wfst-openfst gram.fsm dic.fsm wsj.fsm

Everything is ready for decoding. Let's try with goforward.raw from pocketsphinx

sox -r 16000 -2 -s goforward.raw goforward.wav

create HTK config

TARGETRATE = 100000.0
WINDOWSIZE = 250000.0

Convert to mfcc

HCopy -C config goforward.wav goforward.mfc

Create control file:

echo goforward.mfc > train.scp


juicer -inputFormat htk -lexFName turtle.dic -inputFName train.scp -fsmFName final.fsm -inSymsFName final.insyms -outSymsFName final.outsyms  -htkModelsFName wsj_si84_2750_8/ascii/hmmdefs

Get the result:

<s> go four are <s> ten meters </s>

It's not accurate for some reason. Probably feature extraction is not the same as were used for acoustic model. Probably I should use word insertion penalty.

Of course not everything is so perfect. The main issues with WFST decoder are very well described in documentaion. Basically they are memory requirements for the first pass decoding (that's why Juicer can't run trigram models on commodity hardware) and lack of dynamic search optimization that's more straightforward. Anyway, WFST framework has a lot of applications going beyond just recognition. It's applied for speech indexing, open vocabulary decoding, simplifies confidence scoring.

That's it, you can count it works and embed it into your software. Overall, it's an interesting package demonstrating how simple things could be when you put everything into flexible format. I'm sure CMUSphinx will follow this direction and will implement WFST decoding soon. At least we ultimately need to introduce FST tools in our framework.


Post a Comment