All ideas are already generated

After seeing flash websites take enormous amount of my CPU got a cool idea today about using flash for distributed computing. Basically everything is already in place. You setup webserver, share content with flash, it runs on client computer and does calculations uploading the result from time to time. Certainly I wasn't the first who invented that, see for example


Though such ideas are rather recent and the question is how to make this framework widely used. Looking at current load of the computer at sourceforge it's most likely already used by some websites :)

Training process

What I really like in Sphinxtrain is that it provides straightforward way for training an audio model. It remains unclear for me why everyone bothers with HTKBook while there is clean an easy way to train the model. One should just define the dictionary and transcription and put the files in the proper folder. Anyway, I'm continuously thinking about the way sphinxtrain process could be improved. Currently it indeed lacks a lot of critical information on training and that makes look uncomplete.

Basically here is what I would like to put into the next versions of sphinxtrain and sphinxtrain tutorial:

  1. Description on how to prepare the data
  2. Building of the database transcription. Between, what bothers me last month is the requirement to have fileids. I really think the file with fileids could be silentely dropped. What's the problem to get the id of the file from the transcription labels 
  3. Automatic splitting on training data, testing data and development data. I see development data presense as a hard requirement for the training process. Unfortunately, current documentation lacks it. There could be code to do that, but for most databases it's automatic of course.
  4. Bootstrapping from a hand-labelled data. I think this as an important part of training, HTK results confirm that. In general it repeats human language learning, so I think it's natural as well.
  5. Training
  6. Optimizing number of senones, mixtures on a devel set
  7. Optimizing most important parameters like language weight on the development set. This part is complicated as I see it. First of all the reasononing behind proper language weight scaling is still unclear for me, I could one day write a separate post on it. Basically it depends on everything, even on the decoder
  8. Testing on the test set 
 If it will be possible to keep this as straightforward as it is now that would be just perfect. Probably if I'll start to write the chapter in a week, this could be ready till summer.

Moving Beyond the `Beads-On-A-String'

Recently I've got interested in quite a large domain of speech recognition research where old school linguistic meets modern speech recognition. Basically the idea is that in spontaneous speech variativity is so huge that phonetic transcription from the dictionary doesn't apply well. In plain CMUSphinx setup linguistic information about phones is almost lost like we don't care if phone is labial or dental. It is used in a decision tree building but it's not clear if such usage helps. It's definitely not so good to drop such a huge amount of information that could help with classification. So this idea is actively developed and you can find there everything you miss probably - distinctive phone features, landmarks, spectrogram recognition.

I went through the following articles, the number of methods, approaches and implementations described there is really huge. In other articles it's going to be even bigger:

S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester. Speech production knowledge in automatic speech recognition. Journal of the Acoustical Society of America, 121(2):723-742, February 2007. PDF
Moving Beyond the `Beads-On-A-String' Model of Speech by M. Ostendorf PDF

Speaking In Shorthand - A Syllable-Centric Perspective For Understanding Pronunciation Variation by Steven Greenberg PDF

To be honest the only idea from the articles that grown in my mind is that reductions on fast speech are root of the problem. I also noticed it in early days and was experimenting with a skip states. Skips didn't give any improvements except reduced speed. It will probably help to automatically increase lexicon variability and use forced alignemnt to get proper pronuciation at least at training stage. As I understood I just need to take a dictionary with syllabification and create a dictionary with a lot of reduced variants where onsets are kept as as and codas are reduced in some form. Then we force align, then train. Probably acoustic model will be better then.

Another striking point was that I haven't found any significant accuracy improvement result in the articles I read. Improvement like 20% with discriminative training could make any method widely adopted but nothing like that is mentioned. Probably this research is in very initial state.

Three Generation of IVR Systems

Recently I invented new nice concept for marketing people. Basicallly there are three generations of IVR systems right now:
  • Generation 1.0 - Static systems based on VoiceXML. It was suprising for me they are in wide use now and a lot of products are dedicated to their optimization/develoment. There are IDE's and a lot of testing tools, recommendations how to build proper VoiceXML. Come on, it's impossible to do that. It's something like static HTML websites that were popular in 1995. I don't believe any changes like javascript inside in VXML 3.0 will stop it slow death.
  • Generation 2.0 - Dynamic systems like Tropo from Voxeo. Much easier, much better. More control over content, more integration with the business logic. I really believe it's next generation because it gives developer much more control over the dialog. At least with the power of real scripting language like Python you'll be able to implement something non trivial with just several lines of code. That's AJAX or ROR in speech world.
  • Generation 3.0 - Semantic based IVR. This consists of three components - large vocabulary recognizer, semantic recognizer on top of it and even-based actions on top of it. Probably also an emotion recognition and more intelligent dialog tracking. As I see the developer has to define the structure of the dialog and provide handlers. Such system was described and developed  in CMU long time ago already and also it's described in all ASR textbooks. But I'm not aware of any widely known platform allowing to do this kind of IVR. Once again it shows how big the gap is between the academia and software developers.
If you are planning to create IVR application with CMUSphinx, please, consider IVR generation 3 as your base technology ;) And don't forget to share the code.


Very much on the same topic from a wonderful Nu Echo blog:

PLP is going to be default soon

It looks like MFCC features are going to become a history. Everyone is using 9 combined PLP frames + later LDA projection to 40-50 values. Few examples including Google in it's audio indexing system, IBM and BBN see system description in results, OGI/ICSI and many others.

The issue right now is that sphinx4 PLP implemetation seems to be broken, it produces kind of garbage features which doesn't give enough accuracy after training. Luckily there is HTK. Once this issue will get fixes, I think I'll retrain PLP + MLLT model for Voxforge. Unfortunately I don't have any definite plan for implementation of PLP in sphinxbase.

Greetings and Random Thoughts

So 2010 is here, Happy New Year everyone. Wish you all success and happiness and of course increased decoder accuracy! Now we have a long 10 days vacation in Russia, time to travel, eat, drink and sort out bookmarks, read books on the shelf and watch pending google tech talks. Santa also promised me to do some great changes in sphinx4, waiting for that as well.

Though Ohloh doesn't confirm that, I have a strong feeling that last year the activity around CMUSphinx definitely increased and it's usage is going to grow.

I was thinking a little what should be the direction of sphinx4 development, I think we should consider several factors here. I would be happy to see it as widely-used enterprise level speech recognition engine with a great list of features, but I completely understand that due to the lack of resources it's naive to think we'll be able to do it all. We definitely need to find a market sector for the sphinx project and grow using it. There are already well established projects like HTK that are used widely with their own set of strong and weak features. Julius is used widely as a large vocabulary speech recognition engine with HTK models. It's hard to compete with HTK for us just because it will take years to add that flexibility we probably don't even need. Consider variable of adjustable number of states per phone, something that is only proven to be useful for a small vocabulary task, something we aren't really interested in and I hope will not be interested in a near future. What could be different is our practical orientation.

Many project in speech domain and releated areas are often grown from the research projects and though flexible sometimes, often really unusable in applications since they aren't really designed for that. Usually a research project isn't well documented, has a lot of ways to implement the same thing and some of them are sometimes obsolete. Bugs are rarely fixed and documentation almost missing. Releases are not stable. It's definitely a large field for a commercial support company.

There is a different side, many projects are created in order to solve the user needs, more or less well documented and have stable interfaces, large open community but they are doing so wrong internally I always wonder how they are used at all. Espeak with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn't let it be good with any possible modifications. Another example of this is strikingly Lucene. Unlike lucidimagination blog states states lucene community is thriving, it's definitely not true. The research articles like Lucene and Juru at Trec 2007: 1-Million Queries Track definitely shows there is something wrong with Lucene. Basically it lists several trivial changes well known in research community that make Lucene perform two times better on a standard test. I can't understand why this wasn't integrated into stock after three years since article was published.

Let's hope CMUSphinx will find it's place somewhere in the middle. Also, let's hope this year will bring more useful posts decreasing information overload that is certainly going to be a problem in a near future.