nsh - Speech Recognition With CMU Sphinx
Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.
We did some experiments with NMF and other methods to robustly recognize overlapped speech before but my conclusion is that unless training and test conditions are carefully matched the whole system does not really work, anything unknown on the background really destroys recognition result. For that reason I was very interested to check recent progress in the field. The research is pretty early stage but there are very interesting results for sure.
The talk by Dr. Paris Smaragdis is quite useful to understand connection between non-negative matrix factorization and more recent approach with neural networks which also demonstrate how neural network works by selecting principal components from the data.
One interesting bit from the talk above is the announcement of the bitwise neural networks which are very fast and effective way to classify inputs. I believe it could be another big advancement in the performance of the speech recognition algorithms. The details could be found in the following publication: Bitwise Neural Networks by Minje Kim and Paris Smaragdis. Overall, the idea of the bit-compressed computation to reduce memory bandwidth seem very important (LOUDS language model in Google mobile recognizer also from this area). I think NVIDIA should be really concerned about it since GPU is certainly not the device this type of algorithms need. No more need in expensive Teslas.
Another interesting talk was by Dr.Tuomas Virtanen in which a very interesting database and the approach to use neural networks for separation of different event types is presented. The results are pretty entertaining.
This video also had quite important bits, one of them is the announcement of Detection and Classification of Acoustic Scenes and Events Challenge 2016 (DCASE 2016) in which acoustic scene classification would be evaluated. The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "street", "office". The discussion of the challenge which starts soon is already going in the challenge group, this would be very interesting to participate.
by Thilo Stadelmann et al
This is not mainstream research, but it is exactly what makes it interesting. The main idea of the paper is that to understand and develop speech algorithms we need to advance our tools to assist our intuition. This idea is quite fundamental and definitely has interesting extensions.
Modern tools are limited, most developers only check spectrograms and never visualize distributions, lattices or the context dependency trees. N-grams are also rarely visualized. In speech the paper suggests to build tools not just to view our models, but also to listen for them. I think this is quite a productive idea.
In modern machine learning tools visualization definitely helps to extend our understanding of complex structures. Here a terrific Colah's blog comes to mind. It would be interesting to extend this beyond pictures.
Take a text, take an LM, computer perplexity:
Join every two lines in text:
This is a really serious issue for decoding of conversational speech, the perplexity raised from 158 to 183, in real-life cases it's getting even worse. WER drops accordingly. So many times utterances contain several sentences and it's really crazy that our models can't handle that properly.
This practice has been started long long time ago during NIST evaluations I think, when participants reported system combination WER. NIST even invented ROVER algorithm for better combination.
For me personally such content in a paper reduces quality of the paper significantly. The system combination WER was never meaningful addition. Yes, it's well known that if you combine MFCC with PLP you can reduce WER by 0.1% and probably you will be able to win the competition. From scientific point of view this result adds zero new information, it just a filler for the rest of your paper. Also, to get a combination result of 5 systems you usually spend 5 times more computing individual results. Not worth for 0.1% improvement, you can usually get the same with slightly wider beams.
So instead consider doing something else, try to cover the algorithms you used and explain why do they work, try to describe the troubles you've solved, try to add new questions you consider interesting. At least try to collect more references and write a good overview on the previous research. That will save your time, reader's time and the computing power you used to build another model.
It would be really interesting to see the results obtained with this database, data size should improve the existing system performance. However, I see that this dataset will pose some critical challenge to the research and development community. Essentially, such data size means that it will be very hard to train a model using conventional software and accessible hardware. For example, it takes about a week and a decent cluster to train a model using 1000 hours, with 15000 hours you have to wait several months unless more efficient algorithms will be introduced. So, it is not easy.
On the other hand, we have access to a similar amount of data - a Librivox archive contains way more high-quality recordings with text available. It certainly must be a focus of the development to train a model on Librivox data. Such a training is not going to be straightforward too - new algorithms and software must be created. A critical issue is to design an algorithm which will improve the accuracy of the model without the need to process the whole dataset. Let me know if you are interested in this project.
Between, Librivox accepts donations and they are definitely worth them.
The fundamental paper about PNCC is C. Kim and R. M. Stern. Power-Normalized Cepstral Coefficients for robust speech recognition. IEEE Trans, Speech, Audio, Lang. Process but for detailed explanation of the process and experiments one can look in C. Kim Signal Processing for Robust Speech Recognition Motivated by Auditory Processing, Ph. D Thesis. The code for Octave is available too. The C implementation is also available in bug tracking system, thanks to Vyacheslav Klimkov, and will be committed soon after some cleanup. I hope Sphinx4 implementation will follow.
However, quite some important information is not contained in papers. The main pipeline of PNCC is similar to the one of the conventional MFCC except few modifications. First, a gammatone filterbank is used instead of triangular filterbank. Second, filterbank energies are filtered to remove noise and reverberations effects.And third, power law nonlinearity together with power normalization is applied. Most of the pipeline design is inspired by research on human auditory subsystem.
There is a lot of research on using auditory ideas including power law nonlinearities, gammatone filterbanks and so on in speech recognition and PNCC papers do not cover it fully. Important ones are fundamental paper about RASTA and some recent research about auditory-inpired features like Gammatone Features and Feature Combination by Shulter at al.
PNCC design using auditory features raises quite fundamental questions and not discussed in the papers above though, a one very important paper is Spectral Signal Processing for ASR (1999) by Melvyn Hunt from Dragon. The idea from the paper is:
The philosophical case for taking what we know about the human auditory system as an inspiration for the representation used in our automatic recognition systems was set out in the Introduction, and it seems quite strong. Unfortunately, there does not seem to be much solid empirical evidence to support this case. Sophisticated auditory models have not generally been found to be better than conventional representations outside the laboratories in which they were developed, and none has found its way into a major mainstream system. Certainly, there are successful approaches and features that are generally felt to have an auditory motivation—the use of the mel-scale, the cube-root representation, and PLP. However, this paper has sought to show that they have no need of the auditory motivation, and their properties can be better understood purely signal processing terms, or in some cases in terms of the acoustic properties of the production process. Other successful approaches, such as LDA made no pretense of having an auditory basis.This idea is very important because PNCC paper is very experimental one and doesn't really cover the theory behind the design of the filterbank. There are good things in PNCC design and not so clear things too. Here are some observations I had:
1. PNCC is really simple and elegant feature extraction, all the steps could be clearly understood and that makes PNCC very attractive. Noise robustness properties are really great too.
2. Noise filtering does reduce the accuracy in clean conditions, usually this reduction is visible (about 5% relative) but can be justified since we get quite a good improvement in noise. Despite there is a claim that PNCC is better than MFCC on clean data, my experiments do not confirm that. PNCC paper never provide exact numbers only the graphs that makes it very hard to verify their findings.
3. Band bias subtraction and temporal masking are indeed very reasonable stages to apply in feature extraction pipeline. Given the noise is mostly additive with slowly changing spectrum it's easy to remove noise using long-term integration and analog of the Wiener filtering.
4. Gammatone filterbank doesn't improve significantly over triangular filterbank so essentially it's complexity is not justified.Morever, default PNCC filterbank is suboptimal compared to good tuned MFCC filterbank. The filterbank starts only from 200Hz so for most broadcast recordings it has to be changed to 100Hz.
5. Power law nonlinearlity is mathematically not reasonable since it doesn't help to transform channel modification into the simple addition to be removed with CMN lately. The tests were done on normalized database like WSJ while every real database will show the reduction in performance due to the complex power law effects. Overall power normalization with moving average makes things even worse and reduces the ability to normalize scaled audio on training and decoding stages, for example for very short utterances it's really hard to estimate the power properly. Power nonlinearity could be compensated with variance normalization but there are no signs in the PNCC papers about that. So my personal choice is shifted log nonlinearity which is log for high energies and have a shift at low end to deal with noise. Log is probably a bit less accurate with noise but it is stable and have good scaling properties.
6. For raw MFCC lifter for coefficients has to be applied for best performance or LDA/MLLT has to be applied to make features more gaussian-like. Unfortunatly, PNCC paper doesn't tell anything about liftering or LDA/MLLT. With LDA the results could be quite different from the ones reported.
Still, PNCC seem to provide quite a good robustness in noise and I think PNCC will provide improved performance for default models. The recent plan is to import PNCC into pocketsphinx and sphinx4 as default features and train the models for them.
Unfortunately, it appeared that it's very hard to build a model which is relatively "generic". The language models are very domain-dependent, it's almost impossible to build a good language model for every possible text. Books are almost useless for conversational transcription, no matter what amount of book texts do you have. And, you need terabytes of data in order to reduce accuracy just 1%.
Still, the released language model is an attempt to do so. More importantly, the source texts used to build the langauge model are more-or-less widely available, so it will be possible to extend and improve the model in the future.
I found it quite interesting to solve this problem of domain-dependence. Despite the common fact that trigram models work "relatively well", in fact they do not. This survey I find very relevant
TWO DECADES OF STATISTICAL LANGUAGE MODELING WHERE DO WE GO FROM HERE? by Ronald Rosenfeld
Brittleness across domains: Current language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained. For example, to model casual phone conversations, one is much better off using 2 million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio news broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model
trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period...
- ► 2011 (14)
- ► 2010 (42)