Great Move To Nexiwave

We decided to move all our blogs like Ben's blog, news about SearchMyMeetings and others to Such consolidaion will help us to manage our resources as well as will improve our presence in the web. Being more officially placed we will be more responsible for content as well, so I hope to find out here more useful matherials about speech recogniton, CMUSphinx and other related things soon.

Sorry for the inconvenience.

Intelligent Testing In ASR

To continue previous topic about testing I want to share the information about nice paper I read some time ago which I wanted to bring to our daily practices.

The issue is that the current way we test our systems is far from being optimal, at least there no real theory behind that. I usually apply 1/10th rule in practice where I split data on 9/10 training set and 1/10 testing set. This was done in voxforge as well. Not so good thing since with 70 hours of Voxforge data test set grows to 7 hours and it takes ages to decode it. I took this rule from festival's traintest script. And that's more or less common practice in ASR while things like 10-fold cross-validation aren't popular for computational reasons mostly. Suprisingly, problems like that could be easily solved if only we could focus on real goal of testing - estimating the recognition performance. All our test sets are oversized, one could easily find that looking on decoder results during testing. They tend to stabilize very quickly unless there is some data inconsistency.

Speech recognition practice unfortunately doesn't cover this even in scientific papers. Help comes from character recognition. The nice paper I found is:

Isabelle Guyon, John Makhoul, Fellow, IEEE, Richard Schwartz, and Vladimir Vapnik
What Size Test Set Gives Good Error Rate Estimates?

Authors address the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate. The paper is well written and actually rather clear to understand. There are no complex model behind the testing, nothing speech-specific. There are two valuable points there:
  1. The approach that puts reasoning behind test process
  2. The forumlae itself
To put it simple:  

The test set for medium vocabulary task could be small. If word error rate is expected to be like 10%, by the table on page 9 you can get that to compare two configurations with difference 0.5% absolute you need only 13k words data size. 

That's four times smaller than current Voxforge test set. I think this estimate can be even improved if we'll specialize with speech. I really hope this result will be useful for us and will help us to speedup the process of application testing and optimization.

Testing ASR with Voxforge Database

In development and research the critical issue is proper testing. There was some buzz about that recently, for example at MLoss blog where pros for using open data are considered. One interesting resource that started some time ago is, which combines both open data and open algorithm automatically selecting the best method for the common data set. I think it's not that easily implementable idea because "best" is often different. Sometimes you need speed, sometimes generalization.

In our case by using open data you can easily solve the following problems:

  1. Test the changes you've made in speech decoder and trainer on a practical large-vocabulary database
  2. Estimate how recognition engine performs. It's not just about estimating the accuracy but also about other critical parameters like confidence score quality, decoding speed, lattice variability, noise robustness and so on.
  3. Share the bugs you've found. The situation is that we could definitely fix minor problems that are easy to reproduce. Any serious problem ultimately requires a reproducable test example.

I actually wanted to describe how this works in practice right now. The solution we propose for CMUsphinx developers is a Voxforge database. It's not the only open data source out there, but I think it's most permissive one. Old an4 is good for quick tests, but it definitely doesn't satisfy our needs because everything except large vocabulary recognizer have little sense nowdays.