Speech Like WWW

Talking about custom speech application development I've got a thought. There are quite many speech companies already. Speech application development is actually quite similar to UI design or to web design in sense that you need to be specialized expert in order to create speech interface. What if speech developers will be like web designers - thousands of them every day build customized websites all over the world? What if market is so huge that it will be possible to run hundred shops each working on customer needs.

To be honest I don't quite like web development. It sounds very strange for me that you MUST pay at least $1000 to build something that is pleasantly looking. And for big websites its way more. Whoever created this market didn't think about business, he designed HTML in order to drain money from small and big companies. I tried to create few websites myself, for example CMUSphinx website. Even with all modern tools, CMS platforms, themes and stuff the reality is that you need to be an expert. Otherwise the result will not be satisfactory enough. Menus will overlap, regions will not be aligned, pictures will be blurry and colors will not match. Can it be different? Certainly it can, but not in this world. I can understand that creativity can't be automated, I can't understand that creativity is required for every company.

There are some similar things in software development like for example if you want to develop a telco app you probably want to hire Asterisk developer. And there are thousands Asterisk developers out there. Or if you want JBoss you could find JBoss experts. But I think if you know how to develop applications you can configure Asterisk properly and you can create a bean to acess the database.

Are we interested in creation of huge and diverse speech industry? Can CMUSphinx be the basement for it? No definite answer for now.

Recent issues

Heh, this month I discovered few critical issues in CMUSphinx.
  1. Pocketsphinx doesn't properly decode short silences in FSG/JSGF mode
  2. Sphinx4 doesn't really work with OOV loop in grammar
  3. Pocketsphinx n-best lists are useless because of too many repeated entries
  4. Pocketsphinx accuracy is way lower than sphinx3 one
  5. Supposedly-working sphinxbase LM stuff doesn't work with 32-bit DMP, thus no MMIE training for very large vocabulary
  6. MMIE itself doesn't improve accuracy (tested on Voxforge and Fisher)
  7. It's impossible to extract mixture_weights from recent sendumps in pocketsphinx models, python scripts in SphinxTrain are outdated
  8. PTM model adaptation doesn't work
  9. TextAligner demo from sphinx4 requires way more work to align properly

That's getting crazy, I wonder if I'll be able to find the time to fix all that.

Do You Want To Talk To Your Computer?

Thanks everyone who voted at


To be honest I was suprised because my opinion is just the reverse of this result. I strongly disagree that command and control will be ever usable. Dictation probably will, but definitely not command and control. I have a hobby - collecting complains about voice control. Here are few ones

As article in PCWorld says

If so, you'll love what Microsoft is offering: voice recognition over the air, in which your commands are processed by a server in the clouds and converted into action on your smartphone. Boy, let's burn up those minutes and data plans! And waaait for the slow, usually incorrect response. Android has a similar capability for search, and it's amazingly frustrating to use, not to mention inaccurate.

The one good thing about Microsoft's fantasy about voice-command interfaces: You'll be able to identify a Windows Phone 7 user easily. Just listen for the person pleading with the phone to do what he asked. Whie the rest of us are quietly computing and communicating, he'll be hard to miss.

Another post from CNET

One of the major reasons why speech recognition has not caught on or been seriously looked at in terms of major finances is because people, if given the option of accurate speech recognition too, would still not wanna go for the voice commands, but would rather just use touchscreen. This is because voicing out takes more energy off a person than smoothly running yer fingers on the screen in your hand. Intuitive touchscreens and cleaner interfaces are far better a tool to invest in than making people accustomed to say out words that computers need to understand, process and then implement. Its way easier for the user (in terms of energy used to say it out) to just press the button on touchscreen. There will be certain exceptions, but I'm talking on a mass consumer adoption assumption.

I strongly believe that when we want to communicate with computer, there are better ways than to give them voice commands. Yes, speech is a natural way of communication but it's a communication between people. When you communicate with machine you don't necessary need to speak to it, there are more efficient ways. Even if you are driving.

On the other side I think that analytics, speech mining and similar stuff do have a very shiny future. According to DMG consulting the growth of this market will be 42 percent in 2011. That's a true potential. Speech recognition should seemlessly plug into comunication between people and extract value from it. Being non-intrusive it doesn't break patterns but helps to create the information. That's why we invest so much into mining and not into command and control. That's also the reason I don't want to invest too much time in gnome-voice-control.

No More Word Error Rate

Reading http://delong.typepad.com/sdj/2010/09/when-speech-recognition-software-attacks.html

What he said:

Hi Brad, it's Mike. I had a lunchtime appointment go long and I am bolting back to Evans. I'll be there shortly. See you soon. Thanks.

What Google Voice heard:

That it's mike. I had a list of women go a long and I am old thing. Back evidence. I'll be there for me to you soon. Thanks.

The interesting thing is that it got 17 out of the 26 words right--but those 17 words convey almost none of the information in the message...

I found this paper

Is Word Error Rate a Good Indicator for Spoken Language Understanding Accuracy
Ye-Yi Wang and Alex Acero
2003

It is a conventional wisdom in the speech community that better speech recognition accuracy is a good indicator for better spoken language understanding accuracy, given a fixed understanding component. The findings in this work reveal that this is not always the case. More important than word error rate reduction, the language model for recognition should be trained to match the optimization objective for understanding. In this work, we applied a spoken language understanding model as the language model in speech recognition. The model was obtained with an example-based learning algorithm that optimized the understanding accuracy. Although the speech recognition word error rate is 46% higher than the trigram model, the overall slot understanding error can be reduced by as much as 17%.

We definitely need to address it in sphinx4.

Reading Interspeech 2010 Program

Luckily speech people don't have so many conferences. In machine learning world it seems it's getting crazy. You can have conference every month. Researchers travel more than sales managers. In speech there are ASRU, ICASSP but they don't really matter. It's enough to track Interspeech. Since Tokio is too far, I'm just reading the abstract list from the program. First impressions are:

  • Keynotes are all boring
  • Interesting rise of the subject "automatic error detection in unit selection". At least three! papers are presented on the subject while I haven't seen any of them before. Looks like idea appeared in less then a year! Are they spying each other?
  • RWTH Aachen presented enormous amount of papers, LIUM is also quite fruitful
  • Well, IBM T. J. Watson Research Center is active as well, but thats more a tradition
  • I've met in one paper: "yields modest gains on the English broadcast news RT-04 task, reducing the word error rate from 14.6% to 14.4%" Was it worth writing an article?
  • Cognitive status assessment from speech is important in dialogs. SRI is doing that
  • Strange that reverberation issues are a separate class of problems to solve and largely covered.The problem as a whole looks rather generic - create noise and corruption-stable features. Not sure how reverberation is special here
  • WFST is loudly mentioned
  • Andreas Stolke on SRILM noted that pruning doesn't work with KN-smoothed model! Damn, I was using it
  • Only 2 Russian papers on the whole conference. Well, it's 50% growth to previous year. And one of them is on speech recognition, that's definitely a progress
  • Suprisingly not so much research on confidence measures! Confidence is a REALLY IMPORTANT THING

Reading the abstracts I also selected some papers which could be interesting for Nexiwave. Probably you'll find this list easier to read than 200 papers from original program. Let's hope this list will be useful for me as well. To be honest I didn't manage to read the papers I selected previous year from Interspeech 2009.