nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

Do you trust speech transcription in the cloud

Building a Generic Langauge Model

I spent some time recently building a language model from the open Gutenberg texts, it has been released today:

http://cmusphinx.sourceforge.net/2013/01/a-new-english-language-model-release/

Unfortunately, it appeared that it's very hard to build a model which is relatively "generic". The language models are very domain-dependent, it's almost impossible to build a good language model for every possible text.  Books are almost useless for conversational transcription, no matter what amount of book texts do you have. And, you need terabytes of data in order to reduce accuracy just 1%.

Still, the released language model is an attempt to do so. More importantly, the source texts used to build the langauge model are more-or-less widely available, so it will be possible to extend and improve the model in the future.

I found it quite interesting to solve this problem of domain-dependence. Despite the common fact that trigram models work "relatively well", in fact they do not. This survey I find very relevant


TWO DECADES OF STATISTICAL LANGUAGE MODELING WHERE DO WE GO FROM HERE? by Ronald Rosenfeld


Brittleness across domains: Current language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained. For example, to model casual phone conversations, one is much better off using 2 million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio news broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model
trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period...
Recent advances in language models for speech recognition include discriminative language models which are impractical to build unless you have unlimited processing power


And recurrent neural-network language model implemented by RNNLM toolkit

Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černocký Jan, Khudanpur SanjeevRecurrent neural network based language model, In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, JP

RNNLM are becoming very popular and they could provide some significant gains in perplexity and decoding WER, but I don't believe they could solve domain-dependency issue to enable us to build a truely generic model.

So, research on the subject is still needed. Maybe, once the domain-dependency variance could be properly captured, a way more accurate langauge model could be built.




Recent state of UniMRCP

The cool project in CMUSphinx ecosystem is UniMRCP. It implements MRCP protocol and backends which allow to use speech functions from the most common telephony frameworks like Asterisk or Freeswitch.

Beside Nuance, Lumenvox and SpeechPro backends, it supports Pocketsphinx which is amazing and free. The bad side of it is that it's not going to work out of box. The decoder integration is just not done right:

  • VAD does not properly strip silence before it pass audio frames to a decoder, because of that accuracy of recognition is essentially zero.
  • Decoder configuration is not optimal
  • The result is not retrieved as it might be
Also, UniMRCP is not going to work with recent Asterisk releases like 11, it works with 1.6 as far as I see. The new API is not supported.

So, a lot of work is needed to make it actually working and not confusing the users who build their models with CMUSphinx. However, the perspectives are amazing so one might want to spend some time on finalizing Pocketsphinx plugin in UniMRCP. I hope to see it soon.


How To Choose Embedded Speech Recognizer

There are quite many solutions around to build an open source speech recognition system for low-resource device and it's quite hard to choose. For example you need a speech recognition system for a platform like Raspberry Pi and you consider between HTK, CMUSphinx, Julius and many other implementations.

In order to make an informed decision you need to consider a set of features specifically required to run speech recognition in a low-resource environment. Without them your system will probably be accurate but it also will consume too much resources to be useful. Some of them are:


Blizzard Challenge 2012

This year it's a little bit later, but it's amazing that Blizzard Challenge 2012 evaluation is now online.

This year it's going to be very interesting. The data to create the voices is taken from audiobooks, and one part of the test includes synthesis of paragraphs. That means that you can actually estimate how TTS built from a public data can perform.

The links to register are:

For speech experts:
http://groups.inf.ed.ac.uk/blizzard/blizzard2012/english/registerexperts.html

For other volunteers:
http://groups.inf.ed.ac.uk/blizzard/blizzard2012/english/registerweb.html

The challenge was created in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.

Please distribute the second URL as widely as you can - to your colleagues, students, friends, other appropriate mailing lists, social networks, and so on.


ICASSP 2012


I've recently attendted ICASSP 2012 conference in Kyoto, Japan. As expected it was an amazing experience. Many thanks to organizers, technical program committee and the reviewers for their hard work.

The conference collected more than a thousands experts in signal processing and speech recognition. The total number of submitted papers was more than 2000 and more than 1300 of them were presented. It's enormous amount of information to process and it was really helpful to be there and see everything yourself. Of course, most importantly it's an opportunity to meet the people you work with remotely and talk about speech recognition in person. We talked quite a lot about Google Summer Of Code Project we will run soon. You can expect very interesting features implemented there. It's so helpful to map virtual characters to real people.

And Kyoto, old acient capital of Japan was just beautiful. It's an amazing place to visit.

Given the amount of papers and data I think it's critically important to summarize the material or at least to provide some overview on the results presented. I hope that future organizers will fill that gap. And for now here is a not very long list of papers and topics I found interesting this year.


Google knows better


Well, it might be my personal search trained that way

Dealing with pruning issues

I spent a holiday looking on the issues in poketsphinx decoding in fwdflat mode. Initially I thought it's a bug but it appeared that it's just a pruning issue. The result looked like this:

INFO: ngram_search.c(1045): bestpath 0.00 wall 0.000 xRT
INFO: <s> 0 5 1.000 -94208 0 1
INFO: par..grafo 6 63 1.000 -472064 -467 2
INFO: terceiro 64 153 1.000 -1245184 -115 3
INFO: as 154 176 0.934 -307200 -172 3
INFO: emendas 177 218 1.000 -452608 -292 3
INFO: ao 219 226 1.000 -208896 -181 3
INFO: projeto 227 273 1.000 -342016 -152 3
INFO: de 274 283 1.000 -115712 -75 3
INFO: lei 284 3059 1.000 -115712 -79 3


Speech recognition is essentially a search for a globally best path in a graph. Beam pruning is used to drop the nodes during the search if node score is worse then the best node like in this picture


If beam is too narrow, the result might not be the globally best one despite its locally the best. In practice it could lead to complex issues like desribed above. See the word "lei" spans about 2k frames which means about 20 seconds. Another sign of overpruning is number of words scored per frame