Core Ideas Behind Speech Recognition

While tunning the acoustic model I've got again 40% WER and in the log the following:

***     HAD  BARELY  LEAD CANDY a   CLASSIC  (a100)
Words: 7 Correct: 1 Errors: 6 Percent correct = 14.29% Error = 85.71% Accuracy = 14.29

If you'll check this recognition error you'll find that it's almost impossible to find the reason of it and fix it. Probably some senone was trained incorrectly, probably CMN give error or clipping made MFCC wrong. Probably some noise in the middle break the search. There is nothing you can do about it. That made me think about foundations of ASR.

Considering a speech recognizer engine like sphinx4 one could extract the set of core ideas that lie behind it. Same ideas are usually described in speech recognition textbook. Basically they are:
  • MFCC feature extraction from periodic frames (or PLP, doesn't matter)
  • HMM classifier for acoustic scoring (with state tying)
  • Trigram word-based language model (higher grams aren't effective, lower not so precise)
  • Dynamic search with pruning
Surely commercial systems have a lot of improvement over this baseline, but the core is still the same. Such foundations are certainly reasonable and checked over the years in practice. It's hard to argue agains them. Often newbies tell that something is wrong here, but basically it's because they don't really understand how it works. Critisizm comes from old-school linguists, who do everything with rules and mostly interested in usual cases like pronuciation of "schedule" than in theory.

The only issue is that growing amount of unsolvable unexplainable problems like the problem with accuracy above breaks this theory. Quite unusual fact for me as mathematician since in mathematics theory rarely become invalid. They tranform, grow but usually all of them are stated once in forever. In natural sciences like physics it's usual. The aether theory and mechanical explanation of gravitation is the good example that come to my mind. So there is nothing wrong that this ideology of speech recognition could be reviewed and modified according to the recent findings.

What would I put into such modified theory:
  • Multiresolution feature extraction. Starting from RASTA to fMPE and spikes. The idea is that signals are sparse and nonperiodic, the signal range from 10ms to more than 10 seconds and they all needs to be passed into classifier.
  • Some acoustic classifier that without selected states. The idea of phone is probably natual in slow speech or in teching but I heard so many complains about it. Dropping it seems promising indeed since speech is a process, not a sequence of states. Unfortunately I haven't found any article on this yet. Another promising idea here is margins which could help with out-of-model sounds.
  • Subword stage. I more think that languages with developed morphology like Turkish is more the rule than the exception. Being able to recognize a large set of words in the language is a core capability of usable recognizer and that forces it to operate on subword units. Even English recognizer could benefit from this.
  • Language model without backoff. I recently had discussion with David about that and would like to thank him for this idea. Indeed counts of the model seems to be a reasonable statistics one could keep and use. But further calculation of the language weight should be modified completely. Again, there must be margin to strip some combinations that will never appear in the language. Such idea of using prohibitive rules stays in my mind for a long time. It would be also nice to find any recent articles on this. But there must be a component that will invalidate the output like "barely lead candy".
  • Machine learning for backoff calculations. In continuation of the previous point, the backoff weight should have much more complex structure. Not only trigrams containing the words need to be taken into account, a semantic class should be counted, trigrams with similar class of words ought to be considered. Today I even had idea to apply machine learning to calculate the backoffs. I'm sure someone did this before, also need to look at articles about using machine learning methods to restrict search.
As for tree search, it luckily will stay as is, nothing to argue against it right now. Not sure that such modifications are breaking the initial theory, one could say they aren't really different. I still think they could explain the speech better and help to build better speech recognizer.

No comments:

Post a Comment