CMUSphinx accepted at Google Summer Of Code 2011

So we are in. Great to know that.

For more information see

http://cmusphinx.sourceforge.net/2011/03/cmusphinx-at-gsoc-2011/

I think it's a big responsibility and a big opportunity as well. Of course we don't consider this as a way to improve CMUSphinx itself or as something that will allow us to get features coded for free. Instead, we are looking for new people to join CMUSphinx, becoming the part of it. Maybe it's a great opportunity for Nexiwave as well.

For now the task is to prepare the list of ideas for the projects. I know they need to be drafted carefully. If you want to help, please jump in. I definitely need some help.

Fillers in WFST

Another practical question is - how do you integrate fillers? There is silence class introduced in

A GENERALIZED CONSTRUCTION OF INTEGRATED SPEECH RECOGNITION TRANSDUCERS by Cyril Allauzen, Mehryar Mohri, Michael Riley and Brian Roark

and implemented in transducersaurus.

But you know each practical model has more than just a silence. Fillers like noise, silence, breath, laugh they all go to specific senones in the model. I usually try to minimize them during the training for example joining all them ums, hmms, and mhms into a single phone but I still think they are needed. How to integrate them when you build WFST recognizer?

So I tried few approaches. For example instead of adding just a <sil> class in T transducer I tried to create many branches for each filler. As a result final cascade expands to a huge moster. Like if cascade was 50mb after combination with 1 silence class it is 100mb but after 3-4 classes it's 300mb. Not a nice thing to do.

So I ended in dynamic expansion of silence transitions like this:

if edge is silence:
   for filler in fillers:
      from node.add_edge(filler)

This seems to work well.

Word position context dependency of Sphinxtrain and WFST

Interesting thing about Sphinxtrain models is that it uses word position as a context when looking for a senone for a particular word sequence. That means that in theory a senone for the start word phones could be different from senones for the middle-word phones and senones for the end-word phones. It's actually sometimes the case:

ZH  UW  ER b    n/a   48   4141   4143   4146 N
ZH  UW  ER e    n/a   48   4141   4143   4146 N
ZH  UW  ER i    n/a   48   4141   4143   4146 N

but

AA  AE   F b    n/a    9    156    184    221 N
AA  AE   F s    n/a    9    149    184    221 N

Here in the WSJ model definition from sphinx4 a symbol in a fourth column means "beginning", "end", "internal" or "single" and the other characters are transition matrix ids and senone ids.

However, if you want to build WFST cascade from the model, it's kind of an issue how to embed the word position into context-dependent part of the cascade. My solution was to ignore position. You can ignore position in already prebuilt model since differences caused by word position are small, but to do it consistently it's better to retrain word-position-independent model.

Since of today you can do this easily, mk_mdef_gen tool supports -ignorewpos option which you can set in scripts. Basically everything is counted as an internal triphone. My tests show that this model is not worse than the original one. At least for conversational speech. Enjoy.

P.S. Want to learn more about WFST - read Paul Dixon's blog http://edobashira.com and Josef Novak's blog http://probablekettle.wordpress.com