Recent state of UniMRCP

The cool project in CMUSphinx ecosystem is UniMRCP. It implements MRCP protocol and backends which allow to use speech functions from the most common telephony frameworks like Asterisk or Freeswitch.

Beside Nuance, Lumenvox and SpeechPro backends, it supports Pocketsphinx which is amazing and free. The bad side of it is that it's not going to work out of box. The decoder integration is just not done right:

  • VAD does not properly strip silence before it pass audio frames to a decoder, because of that accuracy of recognition is essentially zero.
  • Decoder configuration is not optimal
  • The result is not retrieved as it might be
Also, UniMRCP is not going to work with recent Asterisk releases like 11, it works with 1.6 as far as I see. The new API is not supported.

So, a lot of work is needed to make it actually working and not confusing the users who build their models with CMUSphinx. However, the perspectives are amazing so one might want to spend some time on finalizing Pocketsphinx plugin in UniMRCP. I hope to see it soon.

How To Choose Embedded Speech Recognizer

There are quite many solutions around to build an open source speech recognition system for low-resource device and it's quite hard to choose. For example you need a speech recognition system for a platform like Raspberry Pi and you consider between HTK, CMUSphinx, Julius and many other implementations.

In order to make an informed decision you need to consider a set of features specifically required to run speech recognition in a low-resource environment. Without them your system will probably be accurate but it also will consume too much resources to be useful. Some of them are:

Blizzard Challenge 2012

This year it's a little bit later, but it's amazing that Blizzard Challenge 2012 evaluation is now online.

This year it's going to be very interesting. The data to create the voices is taken from audiobooks, and one part of the test includes synthesis of paragraphs. That means that you can actually estimate how TTS built from a public data can perform.

The links to register are:

For speech experts:

For other volunteers:

The challenge was created in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.

Please distribute the second URL as widely as you can - to your colleagues, students, friends, other appropriate mailing lists, social networks, and so on.


I've recently attendted ICASSP 2012 conference in Kyoto, Japan. As expected it was an amazing experience. Many thanks to organizers, technical program committee and the reviewers for their hard work.

The conference collected more than a thousands experts in signal processing and speech recognition. The total number of submitted papers was more than 2000 and more than 1300 of them were presented. It's enormous amount of information to process and it was really helpful to be there and see everything yourself. Of course, most importantly it's an opportunity to meet the people you work with remotely and talk about speech recognition in person. We talked quite a lot about Google Summer Of Code Project we will run soon. You can expect very interesting features implemented there. It's so helpful to map virtual characters to real people.

And Kyoto, old acient capital of Japan was just beautiful. It's an amazing place to visit.

Given the amount of papers and data I think it's critically important to summarize the material or at least to provide some overview on the results presented. I hope that future organizers will fill that gap. And for now here is a not very long list of papers and topics I found interesting this year.

Google knows better

Well, it might be my personal search trained that way

Dealing with pruning issues

I spent a holiday looking on the issues in poketsphinx decoding in fwdflat mode. Initially I thought it's a bug but it appeared that it's just a pruning issue. The result looked like this:

INFO: ngram_search.c(1045): bestpath 0.00 wall 0.000 xRT
INFO: <s> 0 5 1.000 -94208 0 1
INFO: par..grafo 6 63 1.000 -472064 -467 2
INFO: terceiro 64 153 1.000 -1245184 -115 3
INFO: as 154 176 0.934 -307200 -172 3
INFO: emendas 177 218 1.000 -452608 -292 3
INFO: ao 219 226 1.000 -208896 -181 3
INFO: projeto 227 273 1.000 -342016 -152 3
INFO: de 274 283 1.000 -115712 -75 3
INFO: lei 284 3059 1.000 -115712 -79 3

Speech recognition is essentially a search for a globally best path in a graph. Beam pruning is used to drop the nodes during the search if node score is worse then the best node like in this picture

If beam is too narrow, the result might not be the globally best one despite its locally the best. In practice it could lead to complex issues like desribed above. See the word "lei" spans about 2k frames which means about 20 seconds. Another sign of overpruning is number of words scored per frame