Time to buy the new video card

Everybody plays with training and recognition on a GPU now. 200x improvement on NVidia CUDA is worth time and money. Moreover that the sample code is already available.

Thanks to prym on #cmusphinx channel on irc for the link and of course huge thanks Chuan Liu for the article and a new project. It would be so great to see the similar patches for Sphinxtrain/sphinx4, it would be the killer feature.

How to improve accuracy

People very often ask "how to improve accuracy?". Since I've got three questions like this today I decided to write more or less extensive description of the ways to solve this problem. Probably it will be a bit sketchy, but I hope it will be helpful. Corrections will be very appreciated as well.

1) Well, first of all let me mention that the problem is complex. It requires an understanding of the Hidden Markov Models, beam search, language modelling and every other technology involved. I really recommend you to read the book on speech recognition first. This one is very good:

http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165

In theory it will be possible to build a speech recognition system without any extensive knowledge as listed above, but it's not the case now. We are working on easy to use system but we are still in the early beginning.

Probably you don't have time to read and study all this. Then think if you have time to implement speech recognition system at all. At least learn basic concepts.

Please also learn how to do programming before you are starting. I think it's obvious you need to have software development experience. We can't teach you Java really.

2) Next step is to setup the basic example of the system and estimate it's accuracy while first part is quite obvious, second is often ignored. Don't do that. It's critical to test accuracy on realistic conditions during whole development process.

Decide what kind of system are you going to implement - large vocabulary dictation, medium vocabulary names recognition or small vocabulary command and control or some other task. Probably you need IVR system. For each system there is a demo already. Use it as a base. Please don't try to build dictation from a command and control demo, it's just not suitable.

Now the important task, the estimation of the accuracy. The examples how to do that could be found in decoder sources, in tutorial and in many more places. Collect test database, recognize it and compute the exact accuracy estimation.

It should be the following:

Command and control 5% WER (word error rate)
Medium vocabulary 15% WER
Large vocabulary 30% WER
Large vocabulary short utterances 50% WER

if you have noisy audio or some accent multiply this number on 2.

4) Compare the actual accuracy with the expected value. If the accuracy is mostly the expected, proceed to the next step. If not, search for the bug. Most likely you've made a mistake in system setup. Check the configuration if it is suitable for the task, check sound speech quality with sound editing, find out the spectrum range, check the sampling rate, accent, dictionary and other parameters. If your WER is 90%, you made a mistake for sure.

For example of task-dependent training consider acoustic database. If you train a small vocabulary acoustic model you need word-based acoustic model with word-based phoneset in Sphinxtrain case. If you are training large vocabulary database make sure your phoneset is not large and make sure you have selected the proper number of senones/mixtures.

5) Once you've reached a baseline, it will take a lot to improve it. Think if it's enough for you and if you can build your application with such accuracy. It's unlikely you'll get significantly more. But if you are brave enough:

  • Use MLLT/VTLN feature space adaptation
  • Use MLLR and other type of online speaker adaptation
  • Adapt language model, use context-sensetive language models
  • Tune beams - try different values and experiment with them
  • Implement the rejection for OOV (out-of-vocabulary) words and other noise sounds
  • Implement noise cancellation.
  • Adapt acoustic models and dictionary to your speakers/their accent.
  • ....
The out-of-vocabulary words are the most frequent issue here. Unlike expected by user, most demos don't do any OOV filtering out of box while it's critical for applications. Unfortunately though unlimited vocabulary systems exist, they are quite complex (though also possible to implement). Most systems have limited vocabulary. That means that you need to implement OOV detection and/or confidence scoring in order to filter the garbage. This is doable and described in demos too, for example in confidence demo.

If you are short of ideas here, join the mailing list, we have a lot of features to implement.

Sphinx4-1.0beta3 is released

The best speech recognition engine is on it's way to world's domination. We are happy to announce the new sphinx4 release. This is still a development version, so bug reports and testing are very appreciated.

Packages

https://sourceforge.net/projects/cmusphinx/files/sphinx4/1.0-beta3/

New Features and Improvements:
  • BatchAGC frontend component
  • Complete transition to defaults in annotations
  • ConcatFeatureExtrator to cooperate with cepwin models
  • End of stream signals are passed to the decoder for end of stream handling
  • Timer API improvement
  • Threading policy is changed to TAS
Bug fixes:
  • Fixes reading UTF-8 from language model dump
  • Huge memory optimization of the lattice compression
  • More stable fronend work with DataStart and DataEnd and optional SpeechStart/SpeechEnd
Thanks:

Yaniv Kunda, Michele Alessandrini, Holger Brandl, Timo Baumann, Evandro Gouvea

Release of the Polish voice for Festival

Very remarkable and long waited release happened recently. The Polish multisyn voice for the Festival TTS system was made available. This is the best multisyn voice available nowdays both in terms of speech material (several hours, much more than any arctic database, around 500 Mb of audio) and label quality (it has manually corrected segment labels). Also it uses some unique synthesis method modifications like target f0 prediction for multisyn combined with ToBI/APML-based intonation module. The scheme code also has some important modifications. I really encourage you to try this voice even if you don't understand Polish. I also look forward into the HTS voices

Thanks a lot to Krzystof for his hard work.