Word position context dependency of Sphinxtrain and WFST

Interesting thing about Sphinxtrain models is that it uses word position as a context when looking for a senone for a particular word sequence. That means that in theory a senone for the start word phones could be different from senones for the middle-word phones and senones for the end-word phones. It's actually sometimes the case:

ZH  UW  ER b    n/a   48   4141   4143   4146 N
ZH  UW  ER e    n/a   48   4141   4143   4146 N
ZH  UW  ER i    n/a   48   4141   4143   4146 N


AA  AE   F b    n/a    9    156    184    221 N
AA  AE   F s    n/a    9    149    184    221 N

Here in the WSJ model definition from sphinx4 a symbol in a fourth column means "beginning", "end", "internal" or "single" and the other characters are transition matrix ids and senone ids.

However, if you want to build WFST cascade from the model, it's kind of an issue how to embed the word position into context-dependent part of the cascade. My solution was to ignore position. You can ignore position in already prebuilt model since differences caused by word position are small, but to do it consistently it's better to retrain word-position-independent model.

Since of today you can do this easily, mk_mdef_gen tool supports -ignorewpos option which you can set in scripts. Basically everything is counted as an internal triphone. My tests show that this model is not worse than the original one. At least for conversational speech. Enjoy.

P.S. Want to learn more about WFST - read Paul Dixon's blog http://edobashira.com and Josef Novak's blog http://probablekettle.wordpress.com


Post a Comment