How to create a speech recognition application for your needs

Sometimes people ask: why there is no high-quality open source speech recognition applications (dictation application, IVR applications, closed-captions alignment, language acquisition and so on). The answer obviously is that nobody wrote them and make them public. It's often noted, for example by Voxforge, that we lack the database for the acoustic model. I admit Voxforge have it's reason to state we need a database. But that's only a little part of the problem, not entirely the problem as a whole.

And as it always happens, the statement of the question doesn't allow constructive answer on it. To get constructive answer you need the following question: How do I create a speech recognition application.

To answer on this let me provide an example. Consider we want to develop flash-based dictation website. The dictation application consists of the following parts which should be created

  • Website, user accounting, user-dependent information storage 
  • Initial acoustic and language models trained with Voxforge audio and other free sources transmitted through Flash codecs
  • Recognizer setup to convert incoming streams into text. Distributed computation framework for the recognizer
  • Recognizer frontend with noise cancellation and VAD
  • Acoustic model adaptation framework to let user adapt the generic acoustic model to their pronunciation 
  • Language model adaptation framework
  • Transcription control package that will process commands during dictation like error correction ones or punctuation ones.
  • Post-processing package to put punctuation and capitalization, date and acronym post-processing
  • Test framework for dictation with dictation recordings and ability to check dictation effectiveness
Everything above could be done with open source tools and have approximately equal complexity and require minimum specialized knowledge. Performance-wise this system should be usable for a large vocabulary dictation for a wide range of users. The core components are:
  • Red5 streaming server
  • Adobe Flex SDK
  • Sphinx4
  • Sphinxtrain
  • Language model toolkit
  • Voxforge acoustic database
So you see mostly it's just an implementation of the existing algorithms and technologies. No rocket science. This makes me think that such application is just a matter of time.

Multiview Representations On Interspeech

From my experience, in every activity it's important to have multilevel view of any activity, interesting is that it's both part of Getting Things Done and just a good practice in software development. Multiple models of the process or just different views help to understand what's going on. The only problem is to make those views consistent. That reminds me the Russian model of the world.

So it's actually very interesting to get a high-level overview of what's going on in speech recognition. Luckily to do that you just need to review some conference materials or journal articles. Latter is more compicated, while former is feasible. So here comes some topics from the plenary talks from Interspeech. Suprisingly they are rather consistent across each other and I hope they really present trends, not just selected topics.

Speech To Information
by Mari Ostendorf

Multilevel representation gets more and more important, in particular in speech recognition. The most complicated task - spontaneous meetings recording requires unifiication of the recognition efforts on all levels from acoustic representation to semantic one. Nice to call this approach "Speech To Information", as a result of speech recogntion not just the words are repaired but even syntactic and semantic structure of the talk. One of the interesting tasks is for example restoration of punctuation and capitalization, something that SRILM does.

Good thing is that testing database for such material is already available for free download. Very uncommon situation to have such representative database in free access. AMI corpus looks like an amazing piece of work.

Singe Method
by Sadaoki Furui

WFST-based T3 decoder looks quite impressive. Single method of data representation used everywhere which more importantly allows combination of the models gives wonderful opportunity. For example consider the example of building high-quality Icelandic ASR system combining WFST for English one and very basic Icelandic one. I imagine the decoder is really simple since basically all structures including G2P rules, language and acoustic model could be weighted finite-state automata.

Bayesian Learning
by Tom Griffiths

Hierachical bayesian learning and things like compressed sensing seems to be a hot topics in mashine learning. Google does that. There are already some efforts to impelement a speech recognizer based on hierachical bayesian learning. Indeed it looks impressive to just feed the audio to the recognizer and make it understand you.

Though probabilistic point of few was always questionable opposed to precise discriminative methods like MPE I'm still looking forward to see progress here. Despite huge amount of audio is required, like I remember there were estimates about 100000 hours I think it's feasible nowdays. For example it already recognizes written digits, so success looks really close. And again, it's also multilevel!

Few open source speech projects

It's interesting that a lot of activity around speech software happen recently. I'm probably too impatient trying to track everything interesting. Even through ISCA-students added twitter feed recently, their website still needs a lot of care. Hopefully, Voxforge will become such resource one day. There is a growing amount of packages, tools, projects and events.

For example I've got in touch with SEMAINE project lead by DFKI recently, an effort to build a multimodal dialogue system which ca, interact with humans with a virtual character, sustain an interaction with a user for some time and react appropriately to the user's non-verbal behaviour. The sources are available and the new release is expected in December as far as I understood, so I'm definitely looking forward. The interesting thing is that SEMAINE incorporates emotion recognition framework with libSVM as a classifier, such framework would be useful in sphinx4 for example. Actually a lot of news come now from the European research institutes, projects from RWTH or TALP promise a lot.

Another example is that I was pleased to find out that in 2009 there was a rich transcription evaluation. It's interesting why results aren't available still and what was the progress on meeting transcription task since 2007.

Probably I would sleep better if I didn't knew all above :)

Using SRILM server in sphinx4

Recently I've added the support for the SRILM language model server to the sphinx4 so it's possible to use much bigger models during the search keeping the same memory requriements and, more important, during lattice rescoring. Lattice rescoring is still in progress, so here is the idea how to use network language model during search.

SRILM has a number of adavantages for example it implements few interesting algorithms and even for simple tasks like trigram language model creation it's way better than cmuclmtk. At least model pruning is
supported.

To start first dump the language model vocabulary since it's required in linguist

ngram -lm your.lm --write-vocab my.vocab

So start the server with

ngram -use-server 5000 -lm your.lm

Configure the recognizer

<component name="rescoringModel"
   type="edu.cmu.sphinx.linguist.language.ngram.NetworkLanguageModel">
   <property name="port" value="5000"/>
   <property name="location" value="your.vocab"/>
   <property name="logMath" value="logMath"/>
</component>

And start the lattice demo. You'll see the result soon.

Adjust the cache according to the size of your model. It shoudlnt' be large for a simple search. Typically the cache size isn't more than 100000 for a simple search.

Still, usage of the large-gram model is not reasonable for a typical search because of the large amount of word trigrams that should be tracked. It's more efficient to use trigram or even bigram model first and make a second recognizer pass with the rescored language model. More details on rescoring in the next posts.

Rhythm of British English in Festival

Interesting how ideas rise from time to time in seemingly unrelated places. Recently I've read nice post in John Well's blog about the proper RP English rhythm and now that issue raised again in gnuspeech mailing list where Dr. Hill cited his work

JASSEM, W., HILL, D.R. & WITTEN, I.H. (1984) Isochrony in English speech: its statistical validity and linguistic relevance. Pattern, Process and Function in Discourse Phonology (collection ed. Davydd Gibbon), Berlin: de Gruyter, 203-225 (J)

I spend some time thinking about how this rhythm is handled in Festival and came to the conclusion there is no such entity there. Probably it's somehow handled by CART for duration and intonation prediction, but not as a separate entity. Though many voices are supposed to be US English, I still think they can benefit from a proper rhythm prediction. Try the example from the movie, "This the house that Jack built" with artic voices. Check if Jack gets enough stress.