Digging the acoustic model bug

It's so useful to sleep a lot and care about the details. Today I've found the bug in the model I think I could search for years. Accidentally the topology of the model was wrong and I was using 5-state hmm without the transition from state number n to state n+2. The result was quite bad recognition accuracy. This issue was so well hidden I wonder if it's possible to discover such things at all.

The method that helped is related to comparative debugging. By checking the performance of pocketsphinx and sphinx4 I found they differ. But it wasn't enough. The critical point that helped is the unrelated change in pocketsphinx. Previously it assigned very small non-zero transition probabilities even if model had zero transitions. Happily, David changed that about a month ago. And the difference in recognition rate between recent and older version of pocketsphinx helped to find the problem. I was really lucky today.

I wonder how diagnostics could be extended here to help to resolve issues like that. It seems for me extremely important to build a recognition system that allows verification. The similar problem came today on another front btw. Our router had issue in DNS relay, but it was almost impossible to discover the reason due to limited diagnostic output. We really need to rethink the current way of error reporting.

Release of msu_ru_nsh_clunits

Don't want to be boring and make this feed a clone of sourceforge, but I've recently released the new Russian voice for Festival.


The new release contains a lot of previous updates that were distributed unofficially. And the most important feature is that labels were updated automatically with SphinxTrain / MLLT models, that improved the performance a bit. I haven't debug the current problems though, so it's not clear what are the next steps to improve the quality. Though it's rather clear that better join algorithms and HMM-based cost functions will improve accuracy. Also I wanted to look on pending transcription algorithm for Russian used in academic Russian synthesizers.

I finally followed the industry mainstream and denied the hand-made labels. It's much easier to keep them automatic for sure, because it brings more flexibility. I don't believe in hand work that appeared to be error-prone as well. Hope that better algorithms like Minimum Segmentation Error training will do their work on creation of the perfect segmentation for TTS database. Also I wanted to think about for processing algorithms that are robust to segmentation errors. They are more reasonable to apply in situation when errors are present by design.

I shifted my day schedule again to US time, which is not that perfect. It used to return back last week when I had to stay awake whole night and day. It finally appeared to be productive, but now it shifted back again. I hope, I'll be able to return it back soon.

Sphinxtrain Release

It seems this will be a month full of releases. Sphinxtrain was released today.

SphinxTrain is the acoustic model training system for the Sphinx family of continuous speech recognition systems.After years of not having an actual release of SphinxTrain, it was time to make one, in anticipation of potentially restructuring the training code. This trainer can produce acoustic models for all versions of Sphinx, and supports VTLN, speaker adaptation and dimensionality reduction.Future releases will support discriminative and speaker-adaptive training, and will be more closely integrated with the Sphinx decoders.

Asterisk and pocketsphinx

Today I'm happy to see long-waited proper support for Asterisk Speech API for pocketsphinx. Now, taking into account the Freeswitch support, we provide competive support for speech recognition applications on two popular PBX platform.

Get the details here

New release of sphinx4

I'm really happy to announce the new release of sphinx4, the speech recognition engine written in Java. I wouldn't say it is the greatest release ever, but it's supposed to breath the life into this project that has quite a long history. A huge number of developers and companies were involved in the development of this product and many producs rely on it, but, unfortunately the project itself is kind of stalled. This new release mostly fixes what was done over last 5! years from the previous release and the major features are:
  • Rewritten XML-based configuration system
  • Frontend improvements like better VAD or MLLT transformation matrix support
  • Cleanup of the properties
  • Batch processor with accuracy estimation for testing
The project inself still clearly lacks well organized documentation, better usability but I think it's subject for a next release due in half a year or so. I really want to see it as a more reliable platform for speech recognition development and hope that new features, active community support, regular releases and stable API will make their work.