nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

Tuesday, January 5, 2010

Post-biological ASR

Recently we had a discussion we had with ionel on #cmusphinx chat on freenode about what should perfect speech recognition engine do. Though I didn't quite understood the purpose of the question the answer I could give was to make it as good as native speaker in convertion of the audio into the text.

I watched today an interview with Ray Kurzweil, nothing really interesting there except the idea of post-biological future where computers replace humans. I understood that my definition of ASR is not a very good definition just because it's well established idea that computers will soon become way better than humans in most tasks, like they are already better in playing chess. I tend to forget this over and over, but it's perfectly reasonable to try to be better and not mimic human functions. Automatic recognizers could be better both in terms of speed, energy consumption and accuracy. Lets hope this year will bring us closer to such future. What will be speech then, that's the question.

Monday, January 4, 2010

Greetings and Random Thoughts

So 2010 is here, Happy New Year everyone. Wish you all success and happiness and of course increased decoder accuracy! Now we have a long 10 days vacation in Russia, time to travel, eat, drink and sort out bookmarks, read books on the shelf and watch pending google tech talks. Santa also promised me to do some great changes in sphinx4, waiting for that as well.

Though Ohloh doesn't confirm that, I have a strong feeling that last year the activity around CMUSphinx definitely increased and it's usage is going to grow.

I was thinking a little what should be the direction of sphinx4 development, I think we should consider several factors here. I would be happy to see it as widely-used enterprise level speech recognition engine with a great list of features, but I completely understand that due to the lack of resources it's naive to think we'll be able to do it all. We definitely need to find a market sector for the sphinx project and grow using it. There are already well established projects like HTK that are used widely with their own set of strong and weak features. Julius is used widely as a large vocabulary speech recognition engine with HTK models. It's hard to compete with HTK for us just because it will take years to add that flexibility we probably don't even need. Consider variable of adjustable number of states per phone, something that is only proven to be useful for a small vocabulary task, something we aren't really interested in and I hope will not be interested in a near future. What could be different is our practical orientation.

Many project in speech domain and releated areas are often grown from the research projects and though flexible sometimes, often really unusable in applications since they aren't really designed for that. Usually a research project isn't well documented, has a lot of ways to implement the same thing and some of them are sometimes obsolete. Bugs are rarely fixed and documentation almost missing. Releases are not stable. It's definitely a large field for a commercial support company.

There is a different side, many projects are created in order to solve the user needs, more or less well documented and have stable interfaces, large open community but they are doing so wrong internally I always wonder how they are used at all. Espeak with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn't let it be good with any possible modifications. Another example of this is strikingly Lucene. Unlike lucidimagination blog states states lucene community is thriving, it's definitely not true. The research articles like Lucene and Juru at Trec 2007: 1-Million Queries Track definitely shows there is something wrong with Lucene. Basically it lists several trivial changes well known in research community that make Lucene perform two times better on a standard test. I can't understand why this wasn't integrated into stock after three years since article was published.

Let's hope CMUSphinx will find it's place somewhere in the middle. Also, let's hope this year will bring more useful posts decreasing information overload that is certainly going to be a problem in a near future.

Wednesday, December 16, 2009

Pocketsphinx Success Story

I was pleased to find out renovated website by Keith Vertanen and an amazing real-life example of creation of the pocketsphinx application Parakeet, a dictation app with correction for Nokia N800:

http://www.keithv.com/software/parakeet/n800/

Keith website and models are invaluable resource for Sphinx developers, in particular his lm_giga models are still the models I would recommend to take for adaptation. But seeing this application in action and reading about it's development should really give a good insight into the process of speech recognition application building, having all good practices described.


Saturday, December 12, 2009

Core Ideas Behind Speech Recognition

While tunning the acoustic model I've got again 40% WER and in the log the following:

THEY'RE ONLY ALLOWED TEN  TO    a   CLASS    (a100)
***     HAD  BARELY  LEAD CANDY a   CLASSIC  (a100)
Words: 7 Correct: 1 Errors: 6 Percent correct = 14.29% Error = 85.71% Accuracy = 14.29

If you'll check this recognition error you'll find that it's almost impossible to find the reason of it and fix it. Probably some senone was trained incorrectly, probably CMN give error or clipping made MFCC wrong. Probably some noise in the middle break the search. There is nothing you can do about it. That made me think about foundations of ASR.

Considering a speech recognizer engine like sphinx4 one could extract the set of core ideas that lie behind it. Same ideas are usually described in speech recognition textbook. Basically they are:
  • MFCC feature extraction from periodic frames (or PLP, doesn't matter)
  • HMM classifier for acoustic scoring (with state tying)
  • Trigram word-based language model (higher grams aren't effective, lower not so precise)
  • Dynamic search with pruning
Surely commercial systems have a lot of improvement over this baseline, but the core is still the same. Such foundations are certainly reasonable and checked over the years in practice. It's hard to argue agains them. Often newbies tell that something is wrong here, but basically it's because they don't really understand how it works. Critisizm comes from old-school linguists, who do everything with rules and mostly interested in usual cases like pronuciation of "schedule" than in theory.

The only issue is that growing amount of unsolvable unexplainable problems like the problem with accuracy above breaks this theory. Quite unusual fact for me as mathematician since in mathematics theory rarely become invalid. They tranform, grow but usually all of them are stated once in forever. In natural sciences like physics it's usual. The aether theory and mechanical explanation of gravitation is the good example that come to my mind. So there is nothing wrong that this ideology of speech recognition could be reviewed and modified according to the recent findings.

What would I put into such modified theory:
  • Multiresolution feature extraction. Starting from RASTA to fMPE and spikes. The idea is that signals are sparse and nonperiodic, the signal range from 10ms to more than 10 seconds and they all needs to be passed into classifier.
  • Some acoustic classifier that without selected states. The idea of phone is probably natual in slow speech or in teching but I heard so many complains about it. Dropping it seems promising indeed since speech is a process, not a sequence of states. Unfortunately I haven't found any article on this yet. Another promising idea here is margins which could help with out-of-model sounds.
  • Subword stage. I more think that languages with developed morphology like Turkish is more the rule than the exception. Being able to recognize a large set of words in the language is a core capability of usable recognizer and that forces it to operate on subword units. Even English recognizer could benefit from this.
  • Language model without backoff. I recently had discussion with David about that and would like to thank him for this idea. Indeed counts of the model seems to be a reasonable statistics one could keep and use. But further calculation of the language weight should be modified completely. Again, there must be margin to strip some combinations that will never appear in the language. Such idea of using prohibitive rules stays in my mind for a long time. It would be also nice to find any recent articles on this. But there must be a component that will invalidate the output like "barely lead candy".
  • Machine learning for backoff calculations. In continuation of the previous point, the backoff weight should have much more complex structure. Not only trigrams containing the words need to be taken into account, a semantic class should be counted, trigrams with similar class of words ought to be considered. Today I even had idea to apply machine learning to calculate the backoffs. I'm sure someone did this before, also need to look at articles about using machine learning methods to restrict search.
As for tree search, it luckily will stay as is, nothing to argue against it right now. Not sure that such modifications are breaking the initial theory, one could say they aren't really different. I still think they could explain the speech better and help to build better speech recognizer.

Friday, December 4, 2009

New CMUSphinx Website Alpha

Most CMU Sphinx websites are outdated. The problems with the one at sourceforge are:
  • Not so modern style
  • No interactivity
  • Loosely organized outdated information
  • Hard to manage/update
  • No CMS/search
Also there is a generic problem with the quality of documentation available. A lot is quite outdated and just confusing.

So I wanted to build a new website for a long time. This site is supposed to be central point for all sphinx tools, including pocketsphinx, sphinx4, cmuclmtk and sphinxtrain.

New website is supposed to be interesting. This site is going to bring more interactivity (sharing, blog posts, voting, comments). It looks a little bit bloggish, but I think it's even better. It would be harder to write more interesting posts, so I invite everyone to participate. I'm sure you have something to say.

So here is the proposed demo version
http://cmusphinx.sourceforge.net/wordpress

We are in process of tranferring the information to the new website, so I really hope to see it running very soon.

Monday, November 30, 2009

How to create a speech recognition application for your needs

Sometimes people ask: why there is no high-quality open source speech recognition applications (dictation application, IVR applications, closed-captions alignment, language acquisition and so on). The answer obviously is that nobody wrote them and make them public. It's often noted, for example by Voxforge, that we lack the database for the acoustic model. I admit Voxforge have it's reason to state we need a database. But that's only a little part of the problem, not entirely the problem as a whole.

And as it always happens, the statement of the question doesn't allow constructive answer on it. To get constructive answer you need the following question: How do I create a speech recognition application.

To answer on this let me provide an example. Consider we want to develop flash-based dictation website. The dictation application consists of the following parts which should be created

  • Website, user accounting, user-dependent information storage 
  • Initial acoustic and language models trained with Voxforge audio and other free sources transmitted through Flash codecs
  • Recognizer setup to convert incoming streams into text. Distributed computation framework for the recognizer
  • Recognizer frontend with noise cancellation and VAD
  • Acoustic model adaptation framework to let user adapt the generic acoustic model to their pronunciation 
  • Language model adaptation framework
  • Transcription control package that will process commands during dictation like error correction ones or punctuation ones.
  • Post-processing package to put punctuation and capitalization, date and acronym post-processing
  • Test framework for dictation with dictation recordings and ability to check dictation effectiveness
Everything above could be done with open source tools and have approximately equal complexity and require minimum specialized knowledge. Performance-wise this system should be usable for a large vocabulary dictation for a wide range of users. The core components are:
  • Red5 streaming server
  • Adobe Flex SDK
  • Sphinx4
  • Sphinxtrain
  • Language model toolkit
  • Voxforge acoustic database
So you see mostly it's just an implementation of the existing algorithms and technologies. No rocket science. This makes me think that such application is just a matter of time.

Friday, November 27, 2009

Multiview Representations On Interspeech

From my experience, in every activity it's important to have multilevel view of any activity, interesting is that it's both part of Getting Things Done and just a good practice in software development. Multiple models of the process or just different views help to understand what's going on. The only problem is to make those views consistent. That reminds me the Russian model of the world.

So it's actually very interesting to get a high-level overview of what's going on in speech recognition. Luckily to do that you just need to review some conference materials or journal articles. Latter is more compicated, while former is feasible. So here comes some topics from the plenary talks from Interspeech. Suprisingly they are rather consistent across each other and I hope they really present trends, not just selected topics.

Speech To Information
by Mari Ostendorf

Multilevel representation gets more and more important, in particular in speech recognition. The most complicated task - spontaneous meetings recording requires unifiication of the recognition efforts on all levels from acoustic representation to semantic one. Nice to call this approach "Speech To Information", as a result of speech recogntion not just the words are repaired but even syntactic and semantic structure of the talk. One of the interesting tasks is for example restoration of punctuation and capitalization, something that SRILM does.

Good thing is that testing database for such material is already available for free download. Very uncommon situation to have such representative database in free access. AMI corpus looks like an amazing piece of work.

Singe Method
by Sadaoki Furui

WFST-based T3 decoder looks quite impressive. Single method of data representation used everywhere which more importantly allows combination of the models gives wonderful opportunity. For example consider the example of building high-quality Icelandic ASR system combining WFST for English one and very basic Icelandic one. I imagine the decoder is really simple since basically all structures including G2P rules, language and acoustic model could be weighted finite-state automata.

Bayesian Learning
by Tom Griffiths

Hierachical bayesian learning and things like compressed sensing seems to be a hot topics in mashine learning. Google does that. There are already some efforts to impelement a speech recognizer based on hierachical bayesian learning. Indeed it looks impressive to just feed the audio to the recognizer and make it understand you.

Though probabilistic point of few was always questionable opposed to precise discriminative methods like MPE I'm still looking forward to see progress here. Despite huge amount of audio is required, like I remember there were estimates about 100000 hours I think it's feasible nowdays. For example it already recognizes written digits, so success looks really close. And again, it's also multilevel!

Blog Archive

About Me

My Photo
nshmyrev
Moscow, Russia
Nowdays I mostly work on open source projects in speech recognition and synthesis like Festival, CMU Sphinx and Voxforge. I also support the Russian parts of those projecs, providing the leading product in ASR and TTS in Russian. In the past I used to participate in GNOME, work on embedded Linux devices and on software development technologies related to automatic software verification and modelling. If you have any questions feel free to contact me by mail nshmyrev at nexiwave dot com or find me in jabber/irc.
View my complete profile