nsh - Speech Recognition With CMU Sphinx
Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.
This practice has been started long long time ago during NIST evaluations I think, when participants reported system combination WER. NIST even invented ROVER algorithm for better combination.
For me personally such content in a paper reduces quality of the paper significantly. The system combination WER was never meaningful addition. Yes, it's well known that if you combine MFCC with PLP you can reduce WER by 0.1% and probably you will be able to win the competition. From scientific point of view this result adds zero new information, it just a filler for the rest of your paper. Also, to get a combination result of 5 systems you usually spend 5 times more computing individual results. Not worth for 0.1% improvement, you can usually get the same with slightly wider beams.
So instead consider doing something else, try to cover the algorithms you used and explain why do they work, try to describe the troubles you've solved, try to add new questions you consider interesting. At least try to collect more references and write a good overview on the previous research. That will save your time, reader's time and the computing power you used to build another model.
It would be really interesting to see the results obtained with this database, data size should improve the existing system performance. However, I see that this dataset will pose some critical challenge to the research and development community. Essentially, such data size means that it will be very hard to train a model using conventional software and accessible hardware. For example, it takes about a week and a decent cluster to train a model using 1000 hours, with 15000 hours you have to wait several months unless more efficient algorithms will be introduced. So, it is not easy.
On the other hand, we have access to a similar amount of data - a Librivox archive contains way more high-quality recordings with text available. It certainly must be a focus of the development to train a model on Librivox data. Such a training is not going to be straightforward too - new algorithms and software must be created. A critical issue is to design an algorithm which will improve the accuracy of the model without the need to process the whole dataset. Let me know if you are interested in this project.
Between, Librivox accepts donations and they are definitely worth them.
The fundamental paper about PNCC is C. Kim and R. M. Stern. Power-Normalized Cepstral Coefficients for robust speech recognition. IEEE Trans, Speech, Audio, Lang. Process but for detailed explanation of the process and experiments one can look in C. Kim Signal Processing for Robust Speech Recognition Motivated by Auditory Processing, Ph. D Thesis. The code for Octave is available too. The C implementation is also available in bug tracking system, thanks to Vyacheslav Klimkov, and will be committed soon after some cleanup. I hope Sphinx4 implementation will follow.
However, quite some important information is not contained in papers. The main pipeline of PNCC is similar to the one of the conventional MFCC except few modifications. First, a gammatone filterbank is used instead of triangular filterbank. Second, filterbank energies are filtered to remove noise and reverberations effects.And third, power law nonlinearity together with power normalization is applied. Most of the pipeline design is inspired by research on human auditory subsystem.
There is a lot of research on using auditory ideas including power law nonlinearities, gammatone filterbanks and so on in speech recognition and PNCC papers do not cover it fully. Important ones are fundamental paper about RASTA and some recent research about auditory-inpired features like Gammatone Features and Feature Combination by Shulter at al.
PNCC design using auditory features raises quite fundamental questions and not discussed in the papers above though, a one very important paper is Spectral Signal Processing for ASR (1999) by Melvyn Hunt from Dragon. The idea from the paper is:
The philosophical case for taking what we know about the human auditory system as an inspiration for the representation used in our automatic recognition systems was set out in the Introduction, and it seems quite strong. Unfortunately, there does not seem to be much solid empirical evidence to support this case. Sophisticated auditory models have not generally been found to be better than conventional representations outside the laboratories in which they were developed, and none has found its way into a major mainstream system. Certainly, there are successful approaches and features that are generally felt to have an auditory motivation—the use of the mel-scale, the cube-root representation, and PLP. However, this paper has sought to show that they have no need of the auditory motivation, and their properties can be better understood purely signal processing terms, or in some cases in terms of the acoustic properties of the production process. Other successful approaches, such as LDA made no pretense of having an auditory basis.This idea is very important because PNCC paper is very experimental one and doesn't really cover the theory behind the design of the filterbank. There are good things in PNCC design and not so clear things too. Here are some observations I had:
1. PNCC is really simple and elegant feature extraction, all the steps could be clearly understood and that makes PNCC very attractive. Noise robustness properties are really great too.
2. Noise filtering does reduce the accuracy in clean conditions, usually this reduction is visible (about 5% relative) but can be justified since we get quite a good improvement in noise. Despite there is a claim that PNCC is better than MFCC on clean data, my experiments do not confirm that. PNCC paper never provide exact numbers only the graphs that makes it very hard to verify their findings.
3. Band bias subtraction and temporal masking are indeed very reasonable stages to apply in feature extraction pipeline. Given the noise is mostly additive with slowly changing spectrum it's easy to remove noise using long-term integration and analog of the Wiener filtering.
4. Gammatone filterbank doesn't improve significantly over triangular filterbank so essentially it's complexity is not justified.Morever, default PNCC filterbank is suboptimal compared to good tuned MFCC filterbank. The filterbank starts only from 200Hz so for most broadcast recordings it has to be changed to 100Hz.
5. Power law nonlinearlity is mathematically not reasonable since it doesn't help to transform channel modification into the simple addition to be removed with CMN lately. The tests were done on normalized database like WSJ while every real database will show the reduction in performance due to the complex power law effects. Overall power normalization with moving average makes things even worse and reduces the ability to normalize scaled audio on training and decoding stages, for example for very short utterances it's really hard to estimate the power properly. Power nonlinearity could be compensated with variance normalization but there are no signs in the PNCC papers about that. So my personal choice is shifted log nonlinearity which is log for high energies and have a shift at low end to deal with noise. Log is probably a bit less accurate with noise but it is stable and have good scaling properties.
6. For raw MFCC lifter for coefficients has to be applied for best performance or LDA/MLLT has to be applied to make features more gaussian-like. Unfortunatly, PNCC paper doesn't tell anything about liftering or LDA/MLLT. With LDA the results could be quite different from the ones reported.
Still, PNCC seem to provide quite a good robustness in noise and I think PNCC will provide improved performance for default models. The recent plan is to import PNCC into pocketsphinx and sphinx4 as default features and train the models for them.
Unfortunately, it appeared that it's very hard to build a model which is relatively "generic". The language models are very domain-dependent, it's almost impossible to build a good language model for every possible text. Books are almost useless for conversational transcription, no matter what amount of book texts do you have. And, you need terabytes of data in order to reduce accuracy just 1%.
Still, the released language model is an attempt to do so. More importantly, the source texts used to build the langauge model are more-or-less widely available, so it will be possible to extend and improve the model in the future.
I found it quite interesting to solve this problem of domain-dependence. Despite the common fact that trigram models work "relatively well", in fact they do not. This survey I find very relevant
TWO DECADES OF STATISTICAL LANGUAGE MODELING WHERE DO WE GO FROM HERE? by Ronald Rosenfeld
Brittleness across domains: Current language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained. For example, to model casual phone conversations, one is much better off using 2 million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio news broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model
trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period...
Beside Nuance, Lumenvox and SpeechPro backends, it supports Pocketsphinx which is amazing and free. The bad side of it is that it's not going to work out of box. The decoder integration is just not done right:
- VAD does not properly strip silence before it pass audio frames to a decoder, because of that accuracy of recognition is essentially zero.
- Decoder configuration is not optimal
- The result is not retrieved as it might be
In order to make an informed decision you need to consider a set of features specifically required to run speech recognition in a low-resource environment. Without them your system will probably be accurate but it also will consume too much resources to be useful. Some of them are:
This year it's going to be very interesting. The data to create the voices is taken from audiobooks, and one part of the test includes synthesis of paragraphs. That means that you can actually estimate how TTS built from a public data can perform.
The links to register are:
For speech experts:
For other volunteers:
The challenge was created in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.
Please distribute the second URL as widely as you can - to your colleagues, students, friends, other appropriate mailing lists, social networks, and so on.
- ► 2011 (14)
- ► 2010 (42)