Finally Sorted Out Workshop Materials

Since they are more CMUSphinx official documents, I posted notes about workshop and meetins after that on the website:

Sphinx Users And Developers Workshop 2010 Results

Development Meeting Notes

I'm pleased to get so many important things planned and few very important issues cleared up. For example I didn't completely understand why lattices in sphinx4 are so bad. I hope other participants had some productive results too.

It would be nice to get some pictures from workshop, but unfortunately none available now. Probably sometime later. So, just Dallas one.


So, I'm back from ICASSP in Dallas, TX. It was very impressive conference with lots of interesting and inspiring presentations, meetings and discussions. Amazing everyone was there and I've finally met all the speech people who guided me for so long time. I've met ASR people Bhiksha Raj, David Huggins-Daines, Rita Singh, Richard Stern and TTS people Alan W. Black, Keiichi Tokuda, Simon King, Heiga Zen. I was pleased to  meet second time wonderful guys like Evandro and Peter. Worth to mention that I was able to listen talks by famous people like Hynek Hermanski.

We had Sphinx Users and Developers Workshop  there and also two CMU Sphinx development planing meetings. But they are  subject for another post. This one is just about interesting ideas presented on the conference by other people. I didn't have time to attend every presentation out there, I think it was impossible. You have to find the time for sightseeing and there were often two or three parallel lecture sessions and also poster presentations which I liked most. I think poster presentation is the best way to access author, ask him questions and get feedback. Many posters were so popular it was almost impossible to get to the stand.

Anyway, the amount of talks I've got already exceeds what can be consumed in a week. It would be nice to get one day all information about current research collected into structure or wiki-like resource. It's a huge work for  the future though.

So here are some presentations and ideas I've met there and found them to be worth attention:

Robust Speaking Rate Estimation Using Broad Phonetic Class Recognition

Authors: Jiahong Yuan; University of Pennsylvania, Mark Liberman;
University of Pennsylvania

Presented work is about using easy classifier to get some specific data about speech like to estimate  deletions in syllables and thus speech quality. This is actually very promising approach which is ignored for some reason in most places where it seems to be practical. For example it's not clear for me why speaker identification framework doesn't try to find phonetic classes first and build GMM only after that. It seems to be a natural approach to improve SID performance.

Broad phonetic classes remind me the idea from the famous face recognition algorithm by Viola and Jones about applying cascade for fast classification. This idea could be applied to speech in some form like authors suggest I think.

Clap Your Hands! Calibrating Spectral Substraction for Reverberation

Authors: Uwe Zaeh, Korbinian Riedhammer, Tobias Bocklet, Elmar Noeth

Reverberation was very popular on this conference and especially it's important for meetings. Various speech system require various noise cancellation. Far microphone need to fix reverberation from room, close microphones need to fix clicks and so on. Far microphones sometimes do calibration for reverberation estimation. This defines the set of components sphinx4 could have to deal with various environment conditions. Right now they are simply missing.

Detecting Local Semantic Concepts in Environmental Sounds using Markov Model

Authors: Keansub Lee, Daniel Ellis, Alexander Loui

Interesting that classification database for this task is available at, this could be a base for non-speech recognition research.

Learning Task-Dependent Speech Variability In Discriminative Acoustic Model Adaptation

Authors: Shoei Sato, Takahiro Oku, Shinichi Homma, Akio Kobayashi, Toru Imai

Discriminative approaches are popular now days. Direct optimization of the cost function could serve on various stages of training process. In this work for example the set of subword units is selected to minimize decoding error rate.

An Improved Consensus-Like method for Minimum Bayes Risk Decoding and Lattice Combination

Authors: Haihua Xu, Daniel Povey, Lidia Mangu, Jie Zhu

This deals with specific criterion for lattice decoding. Not just best path could be chosen but other criterion like consensus could also apply. For me personally it would be very interesting to formalize and apply the criterion that will ensure grammatical correctness of the result. I haven't found anything on this yet.

Discriminative training based on an integrated view of MPE and MMI in margin and error space

Authors: Erik McDermott, Shinji Watanabe, Atsushi Nakamura
Interesting to find out that real math goes into ASR. Basically it was a long waited thing and it seems it was started by Georg Heygold with his works on MMI and other methods. It would be nice to review this area to get some idea what's the outcome of it. Heygold was sited in almost every presentation, so it's really getting popular.

Balancing False Alarms and Hits in Spoken Term Detection

Authors: Carolina Parada, Abhinav Sethy, Bhuvana Ramabhadran

It's interesting to see what tools are used. WFST's are very convenient and used by everyone. IBM, Google, AT&T. This is also a topic for separate post.

Bayesian Analysis of Finite Gaussian Mixtures

Authors: Mark Morelande, Branko Ristic

Rather old idea (there are similar papers from 1998) to use Bayesian learning to estimate number of mixtures in the model. I'm in favor of the approach to estimate all model parameters including number of mixtures, language weight and number of senones at once.

Improving Speech Recognition by Explicit Modeling of Phone Deletions

Authors: Tom Ko

Pronunciation variation by phone deletion looks very promising since traditional linguists mostly complain about sequential HMM model which doesn't handle deletions correctly. Unfortunately, the effect  of this seems to be small. The improvement cited is only from 91.5% to 92%.

An Efficient Beam Pruning With A Reward Considering The Potential To Reach Various Words.

Authors: Tsuneo Kato, Kengo Fujita, Nobuyuki Nishizawa

Beam pruning according to the number of reachable words or to other the risk function. Good idea to implement in sphinx4 to speedup recognition. Factor cited is 1.2 for a large vocabulary.

That's it. I missed Friday and most early mornings unfortunately, so something  interesting could be there. I'm sure you could select your own set. It's  interesting to look on it.

Sphinx4 1.0 beta4 Is Released. What's next?

So, almost according to schedule, sphinx4 was released yesterday. Check the notes at

Most notable improvements were already discussed here, so let me try to plan what the next release will be. Trying to be realistic in plans, I don't want to promise everything at once. Here is some attempt to forecast the next release notes

The biggest issue with sphinx4 is actually documentation. Current poll on CMUSphinx website clearly shows that. Personally I sometimes think that perfect documentation will not help if system doesn't work, but at least it will make product attractive and easy to use. My idea is that we need to have more developer-level documentation - tutorial, examples, task-oriented howtos. It's unlikely we'll be able to write something that is good enough as textbook on speech technologies. But we need to prove the point that it's possible to build ASR system without understanding who is Welch.

On the code side, we face a biggest challenge since sphinx4 was designed. We need to move to the multipass system. It's not just about rescoring, it's about plugging diarization framework from LIUM, it's also about making sphinx4 suitable for both batch and live applications. That's the serious issue.

The reason is that currently sphinx4 architecture is flow-oriented. It's built like a single pipe of components each passing audio to other. This is good for live applications, but not so good for batch ones. You get troubles when you need to split pipe or merge it later. In batch application one could have a huge benefit from looking on recording as a whole and returning to recording multiple times. For example, you could estimate noise level properly and just cleanup audio on the second pass. Such multipass decoding doesn't well fit into pipe paradigm. On the other side, changing it to purely batch will create issues for live applications.

So we are in trouble. We have to invent some combined scheme probably and create a hybrid of pipe and batch approaches. I was thinking about knowledge base scheme when information about stream is stored in some database as processing goes. Database cleanup policies could emulate both pipe (when database is immediately cleaned) and batch approaches (when database is kept even over sessions). Festival utterances remind me such data processing scheme between. Anyway, this idea is not finalized yet.

We also expect to see a lot of movement from CMUSphinx Workshop in Dallas and in Google Summer of code participation. I hope issues described above and some more interesting issuses will be resolved till next release in August. Let's discuss the rest then!