Campaign For Decoding Performance

We spent some time to make speech recognition backend faster. Ben reports in his blog the results on moving scoring to GPU with CUDA/jCUDA, which reduced scoring time dramatically. That's an improvement we are happy to apply in our production environment.

We consider that GPU is not just a speedup of computation, it's a paradigm shift. Historically search is optimized to make the number of scored tokens smaller since it affected accuracy. Now scoring is immediate, but that means that other parts should be changed. There are few issues to smash on the way:

We really target to make it even more faster, in particular we would really like to solve grow part problem.

Want To Learn How Sphinx4 Works? Help With Wiki!

Long time ago when sphinx4 development was active, the team used twiki hosted at CMU. Unlike many open source projects, this wiki was actually not just a collection of random stuff, but a complete project support system. It contained meetings notes, design decisions and prototyping results. You could find there diagrams and explanations on what is grow skipping or what is skew pruning.

This wiki died some time ago due to administrative issues, but that was even better since the content from it was merged into our current main wiki at sourceforge at sphinx4 namespace. Unfortunately during transition the formatting was lost since dokuwiki formats aren't always the same as twiki ones. That's actually not so bad as well because the content needs to be renewed in order to fit into current state of sphinx4.

So right now there are like 170 pages total about sphinx4. Some of them are useful, some aren't. They definitely contain deep knowledge of sphinx4 internals, something that will probably help you next time you will optimize the performance of large vocabulary recognition with sphinx4. I'm in process of slowly sorting them out but that will take a lot of time. It's your chance to join, help and learn!

Not All Speaker Adaptations Are Equally Useful

Some time ago I was rather encouraged by VTLN which is vocal tract normalization. By so-called frequence warping it tries to unify vocal tract lenght of all speakers and thus make better model. It's done by shifting and adjusting mel filter frequencies. This thing is implemented in Sphinxtrain/Pocketsphinx/Sphinx4. Basically all you need is to enable it in sphinx_train.cfg

$CFG_VTLN = 'yes';

And run the training. It will extract features with all frequency warp parameters with some step (take care about space on disk) and will find out best one with forced alignment of each utterance. Then it will create new fileids and transcription files with reference to the file with proper warp parameter.

To decode with VTLN model you need to guess warp parameter. There are several algorithms suggested to do that. One analyses pitch, others employ GMM for classification. Then you need to reextract features with predicted warp parameter. It gives some visible improvement in performance.

But recently a set of articles like


came into my sight thanks to antonsrv8. The simple idea is that any transform which we do on features, especially smooth transform could be mostly replaced just by linear transform of MFCC coefficients, basically by MLLR transformation. This kind of obvious fact makes me think if we really need other transformations if MLLR is generic enough. It's not harder to estimate MLLR than to estimate warp factor, especially if data is large enough which is usually the case. Another transformation applied will just conflict with MLLR. On large data sets this is confirmed by experimental results in article above.

Of course non-linear transform like VTLN could be better than linear one, but it's certainly not VTLN it seems. I hope latest state of art in voice conversion could suggest something better.

Update: this point was of course largely covered in research papers. Good coverage with math and results is provided in Luis thesis:

Speaker Normalisation and Adaptation in Large Vocabulary Speech Recognition by Lu ́s Felipe Uebel

Recognizer Language Voting Over

So, language voting is over. It seems that despite performance issues we currently face Java gets enough attention. Thanks for sharing your opinion, it's very important for us.

Speech Decoding Engines Part 1. Juicer, the WFST recognizer

ASR today is quite diverse. While in 1998 there was only a HTK package and some inhouse toolkits like CMUSphinx released in 2000, now there are dozen very interesting recognizers released to the public and available under open source licenses. We are starting today the review series about them.

So, the first one is Juicer, the WSFT recognizer from IDIAP, University of Edinburgh and University of Sheffield.

Weighted finite state transducers (WFST) are very popular trend in modern ASR, with very famous addicts like Google, IBM Watson center and so many others. Basically the idea is that you convert everything into same format that allows not just unified representation, but more advanced operations like merging to build a shared search space or reducing to make that search space smaller (with operations like determinization and minimization). Format also provides you interpolation properties, for example you don't need to care about g2p anymore, it's automatically done by transducer. For WFST itself, I found a good tutorial by Mohril, Pereira and Riley "Speech Recognition with Weighted State Transducers".

Juicer can do very efficient decoding with stanard set of ASR tools - ARPA language model (bigram due to memory requirements), dictionary and cross-word triphone models could be trained by HTK. BSD license makes Juicer very attractive. Juicer is part of AMI project that targets meeting transcription, other AMI deliverables are subject for separate posts though.

Face Recognizers, Bloom filters and Application to Speech Recognition

In scientific paper waterfall we have today I continuously face the issue of  selection of high-level important approaches to the problem. Many ideas are definitely important and lead to accuracy improvement but they are certainly not counted as core ones. Like another feature extraction algorithm that could bring you 2% of performance improvement. I definitely miss some high-level up-to-date reviews that could lead into the world of possible approaches taken and their advantages and disadvantages. I was counting on books in that, but unfortunately they aren't as accessible as papers.

Some time ago I went into reading the core face detection paper by Viola and Jones about Haar cascades for object detection. It struck me that their method which appeared to be very fruitful in face and object detection didn't get into common practice in speech recognition.

Basically the idea of their method is that it's possible to reduce search space significantly with very weak set of classifiers. For example you can easily find out that there is no face on the green grass and thus you can skip this region. This is rather fruitful idea that you can classify negatives much more accurately then positives. Putting things into cascade make search space tiny and recognition fast and efficient. Certainly it's not the only algorithm of this type, other one I met recently is bloom filters with almost the same method for efficient hash search.

The transfer of this into ASR is rather straightforward. We need to train weak classifiers that reject phone hypothesis for a given set of frames. That's actually quite easy with SVM or something built on top of existing HMM segmentation. Next, we could also apply this to a language model and reject some hypothesis which aren't possible in the language.

I haven't seen any papers on that, probably I need to search more. This idea is certainly worth to try and it should get into common ASR practices like discriminative training, adaptation with linear regression or multipass search.