nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

Do you trust speech transcription in the cloud

IWSLT 2015

IWSLT 2015 proceedings recently appeared. This is an important competition in ASR focused on TED talks translation (and, more interesting for us, transcription).

Best system from MITLL-AFRL had a nice WER 6.6%.

It is interesting that most of the winner system (same was in MGB challenge Cambridge system ) were using combinations of customized HTK + Torch and Kaldi. Kaldi alone does not get the best performance (11.4%), plain custom HTK is usually better with WER 10.0% (see Table 8). And combination usually gives ground-breaking result.

There is something interesting here.


Harmonic Noise Model in Speech Recognition


Recently I came around a nice demo about generation of natural sounds from physical models. This is really an exciting topic because while Hollywood can now draw almost everything like Star Wars, the sound generation is pretty limited and unexplored area. For example, really high quality speech still can not be created by computers, no matter how powerful they are. This leads to a question of speech signal representation.

Accurate speech signal representations made a big difference in different areas of speech processing like TTS, voice conversion, voice coding. The core idea is very simple and straightforward but also powerful - we notice the fact that acoustic signals are either produced by harmonic oscillation in which case it has structure or by a turbulence cavitation in which case we see something like white noise. In speech such classes are represented by vowels and sibilant consonants, everything else is a mixture of those with some degree of turbulence and some degree of structure. However, this does not really speech-specific, all other real world signals except artificial ones might be analyzed from this point of view.

Such representation allowed to greatly improve voice compression in the class of MELP codecs (mixed excitations linear prediction). Basically we represent the speech as noise and harmonics and compress them separately. That allowed to improve compression of speech signal to unbelievable 600b/s. Mixed excitation was very important in text-to-speech synthesis. And it really made a big difference, as was proven quite some time ago by Mixed excitation for HMM-based speech synthesis by Takayoshi Yoshimura at al. 2001.

Unfortunately there is very little published research on mixed excitation models for speech recognition. I only found a paper A harmonic-model-based front end for robust speech recognition by Michael L. Seltzer which does consider harmonic and noise model but focus on robust speech recognition and not the advantages of the model itself. However, I believe such model can be quite important for speech analysis because it allows to classify speech events with very high degree of certainty. For example, if you consider a task of creating TTS system from voice recording, you might still notice that even best algorithms still confuse sounds a lot, assign incorrect boundaries, select wrong annotation. More accurate signal representation could help here.

It would be great if readers share more links on this, thank you!

On SANE 2015 Videos on Signal Separation

Recently a great collection of videos from Speech and Audio in the Northeast (SANE) 2015 workshop has been shared. The main topic of the workshop was sound signal separation which I consider very important direction of research for the new future, something that would be critical to solve to get human-like performance of speech recognition systems.

We did some experiments with NMF and other methods to robustly recognize overlapped speech before but my conclusion is that unless training and test conditions are carefully matched the whole system does not really work, anything unknown on the background really destroys recognition result. For that reason I was very interested to check recent progress in the field. The research is pretty early stage but there are very interesting results for sure.

The talk by Dr. Paris Smaragdis is quite useful to understand connection between non-negative matrix factorization and more recent approach with neural networks which also demonstrate how neural network works by selecting principal components from the data.


One interesting bit from the talk above is the announcement of the bitwise neural networks which are very fast and effective way to classify inputs. I believe it could be another big advancement in the performance of the speech recognition algorithms. The details could be found in the following publication: Bitwise Neural Networks by Minje Kim and Paris Smaragdis. Overall, the idea of the bit-compressed computation to reduce memory bandwidth seem very important (LOUDS language model in Google mobile recognizer also from this area). I think NVIDIA should be really concerned about it since GPU is certainly not the device this type of algorithms need. No more need in expensive Teslas.

Another interesting talk was by Dr.Tuomas Virtanen in which a very interesting database and the approach to use neural networks for separation of different event types is presented.  The results are pretty entertaining.

This video also had quite important bits, one of them is the announcement of Detection and Classification of Acoustic Scenes and Events Challenge 2016 (DCASE 2016) in which acoustic scene classification would be evaluated. The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "street", "office". The discussion of the challenge which starts soon is already going in the challenge group, this would be very interesting to participate.

Should we listen our models

I've recently met an interesting paper worth consideration

Rethinking Algorithm Design and Development in Speech Processing
by Thilo Stadelmann et al

This is not mainstream research, but it is exactly what makes it interesting. The main idea of the paper is that to understand and develop speech algorithms we need to advance our tools to assist our intuition. This idea is quite fundamental and definitely has interesting extensions.

Modern tools are limited, most developers only check spectrograms and never visualize distributions, lattices or the context dependency trees. N-grams are also rarely visualized. In speech the paper suggests to build tools not just to view our models, but also to listen for them. I think this is quite a productive idea.

In modern machine learning tools visualization definitely helps to extend our understanding of complex structures. Here a terrific Colah's blog comes to mind. It would be interesting to extend this beyond pictures.

Very simple but very important thing to properly model the language

If I would be a scientific advisor I would give my student the following problem:

Take a text, take an LM, computer perplexity:

file test.txt: 107247 sentences, 1.7608e+06 words, 21302 OOVs 0 zeroprobs, logprob= -4.06198e+06 ppl= 158.32 ppl1= 216.345

Join every two lines in text:
awk 'NR%2{printf "%s ",$0;next}{print;}' test.txt > testjoin.txt

Test again:
file testjoin.txt: 53624 sentences, 1.7608e+06 words, 21302 OOVs 0 zeroprobs, logprob= -4.05859e+06 ppl= 183.409 ppl1= 215.376

This is a really serious issue for decoding of conversational speech, the perplexity raised from 158 to 183, in real-life cases it's getting even worse. WER drops accordingly. So many times utterances contain several sentences and it's really crazy that our models can't handle that properly.

System Combination WER

There is one thing I usually wonder about while reading the next conference paper on speech recognition. The usual paper limit is 4 pages and the authors usually want to write exactly 4 pages. What should you do if you don't have enough information? Right, you can build exactly same systems with PLP features and MFCC features and probably with some other features and you can add one more table about system combination WER and probably add one graph too or you can mix two types of LM and report another nice graph.

This practice has been started long long time ago during NIST evaluations I think, when participants reported system combination WER. NIST even invented ROVER algorithm for better combination.

For me personally such content in a paper reduces quality of the paper significantly. The system combination WER was never meaningful addition. Yes, it's well known that if you combine MFCC with PLP you can reduce WER by 0.1% and probably you will be able to win the competition. From scientific point of view this result adds zero new information, it just a filler for the rest of your paper. Also, to get a combination result of 5 systems you usually spend 5 times more computing individual results. Not worth for 0.1% improvement, you can usually get the same with slightly wider beams.

So instead consider doing something else, try to cover the algorithms you used and explain why do they work, try to describe the troubles you've solved, try to add new questions you consider interesting. At least try to collect more references and write a good overview on the previous research. That will save your time, reader's time and the computing power you used to build another model.


Mixer 6 database release by LDC & Librivox

LDC has recently announced availability of a very large speech database for acoustic model training. A database named Mixer 6 contains incredible amount of 15000 hours of transcribed speech data by few hundred speakers.While commercial companies have access to a significantly bigger sets, Mixer is the biggest data set compared to databases used in research ever before. Previously available Fisher database has only around 2000 hours.

It would be really interesting to see the results obtained with this database, data size  should improve the existing system performance. However, I see that this dataset will pose some critical challenge to the research and development community. Essentially, such data size means that it will be very hard to train a model using conventional software and accessible hardware. For example, it takes about a week and a decent cluster to train a model using 1000 hours, with 15000 hours you have to wait several months unless more efficient algorithms will be introduced. So, it is not easy.

On the other hand, we have access to a similar amount of data - a Librivox archive contains way more high-quality recordings with text available. It certainly must be a focus of the development to train a model on Librivox data. Such a training is not going to be straightforward too - new algorithms and software must be created. A critical issue is to design an algorithm which will improve the accuracy of the model without the need to process the whole dataset. Let me know if you are interested in this project.

Between, Librivox accepts donations and they are definitely worth them.