nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

Do you trust speech transcription in the cloud

System Combination WER

There is one thing I usually wonder about while reading the next conference paper on speech recognition. The usual paper limit is 4 pages and the authors usually want to write exactly 4 pages. What should you do if you don't have enough information? Right, you can build exactly same systems with PLP features and MFCC features and probably with some other features and you can add one more table about system combination WER and probably add one graph too or you can mix two types of LM and report another nice graph.

This practice has been started long long time ago during NIST evaluations I think, when participants reported system combination WER. NIST even invented ROVER algorithm for better combination.

For me personally such content in a paper reduces quality of the paper significantly. The system combination WER was never meaningful addition. Yes, it's well known that if you combine MFCC with PLP you can reduce WER by 0.1% and probably you will be able to win the competition. From scientific point of view this result adds zero new information, it just a filler for the rest of your paper. Also, to get a combination result of 5 systems you usually spend 5 times more computing individual results. Not worth for 0.1% improvement, you can usually get the same with slightly wider beams.

So instead consider doing something else, try to cover the algorithms you used and explain why do they work, try to describe the troubles you've solved, try to add new questions you consider interesting. At least try to collect more references and write a good overview on the previous research. That will save your time, reader's time and the computing power you used to build another model.


Mixer 6 database release by LDC & Librivox

LDC has recently announced availability of a very large speech database for acoustic model training. A database named Mixer 6 contains incredible amount of 15000 hours of transcribed speech data by few hundred speakers.While commercial companies have access to a significantly bigger sets, Mixer is the biggest data set compared to databases used in research ever before. Previously available Fisher database has only around 2000 hours.

It would be really interesting to see the results obtained with this database, data size  should improve the existing system performance. However, I see that this dataset will pose some critical challenge to the research and development community. Essentially, such data size means that it will be very hard to train a model using conventional software and accessible hardware. For example, it takes about a week and a decent cluster to train a model using 1000 hours, with 15000 hours you have to wait several months unless more efficient algorithms will be introduced. So, it is not easy.

On the other hand, we have access to a similar amount of data - a Librivox archive contains way more high-quality recordings with text available. It certainly must be a focus of the development to train a model on Librivox data. Such a training is not going to be straightforward too - new algorithms and software must be created. A critical issue is to design an algorithm which will improve the accuracy of the model without the need to process the whole dataset. Let me know if you are interested in this project.

Between, Librivox accepts donations and they are definitely worth them.



Around noise-robust PNCC features

Last week I've been working on PNCC features which are famous features for speech recognition by Chanwoo Kim and Richard Stern. I made quite some experiments with parameters and research around PNCC. Here are some thoughts on that.

The fundamental paper about PNCC is C. Kim and R. M. Stern. Power-Normalized Cepstral Coefficients for robust speech recognition. IEEE Trans, Speech, Audio, Lang. Process but for detailed explanation of the process and experiments one can look in C. Kim Signal Processing for Robust Speech Recognition Motivated by Auditory Processing, Ph. D Thesis. The code for Octave is available too. The C implementation is also available in bug tracking system, thanks to Vyacheslav Klimkov, and will be committed soon after some cleanup. I hope  Sphinx4 implementation will follow.

However, quite some important information is not contained in papers. The main pipeline of PNCC is similar to the one of the conventional MFCC except few modifications. First, a gammatone filterbank is used instead of triangular filterbank. Second, filterbank energies are filtered to remove noise and reverberations effects.And third, power law nonlinearity together with power normalization is applied. Most of the pipeline design is inspired by research on human auditory subsystem.

There is a lot of research on using auditory ideas including power law nonlinearities, gammatone filterbanks and so on in speech recognition and PNCC papers do not cover it fully. Important ones are fundamental paper about RASTA and some recent research about auditory-inpired features like Gammatone Features and Feature Combination by Shulter at al.

PNCC design using auditory features raises quite fundamental questions and not discussed in the papers above though, a one very important paper is Spectral Signal Processing for ASR (1999) by Melvyn Hunt from Dragon. The idea from the paper is:
The philosophical case for taking what we know about the human auditory system as an inspiration for the representation used in our automatic recognition systems was set out in the Introduction, and it seems quite strong. Unfortunately, there does not seem to be much solid empirical evidence to support this case. Sophisticated auditory models have not generally been found to be better than conventional representations outside the laboratories in which they were developed, and none has found its way into a major mainstream system. Certainly, there are successful approaches and features that are generally felt to have an auditory motivation—the use of the mel-scale, the cube-root representation, and PLP. However, this paper has sought to show that they have no need of the auditory motivation, and their properties can be better understood purely signal processing terms, or in some cases in terms of the acoustic properties of the production process. Other successful approaches, such as LDA made no pretense of having an auditory basis.
This idea is very important because PNCC paper is very experimental one and doesn't really cover the theory behind the design of the filterbank. There are good things in PNCC design and not so clear things too. Here are some observations I had:

1. PNCC is really simple and elegant feature extraction, all the steps could be clearly understood and that makes PNCC very attractive. Noise robustness properties are really great too.

2. Noise filtering does reduce the accuracy in clean conditions, usually this reduction is visible (about 5% relative) but can be justified since we get quite a good improvement in noise. Despite there is a claim that PNCC is better than MFCC on clean data, my experiments do not confirm that. PNCC paper never provide exact numbers only the graphs that makes it very hard to verify their findings.

3. Band bias subtraction and temporal masking are indeed very reasonable stages to apply in feature extraction pipeline. Given the noise is mostly additive with slowly changing spectrum it's easy to remove noise using long-term integration and analog of the Wiener filtering.

4. Gammatone filterbank doesn't improve significantly over triangular filterbank so essentially it's complexity is not justified.Morever, default PNCC filterbank is suboptimal compared to good tuned MFCC filterbank. The filterbank starts only from 200Hz so for most broadcast recordings it has to be changed to 100Hz.

5. Power law nonlinearlity is mathematically not reasonable since it doesn't help to transform channel modification into the simple addition to be removed with CMN lately. The tests were done on normalized database like WSJ while every real database will show the reduction in performance due to the complex power law effects. Overall power normalization with moving average makes things even worse and reduces the ability to normalize scaled audio on training and decoding stages, for example for very short utterances it's really hard to estimate the power properly. Power nonlinearity could be compensated with variance normalization but there are no signs in the PNCC papers about that. So my personal choice is shifted log nonlinearity which is log for high energies and have a shift at low end to deal with noise. Log is probably a bit less accurate with noise but it is stable and have good scaling properties.

6. For raw MFCC lifter for coefficients has to be applied for best performance or LDA/MLLT has to be applied to make features more gaussian-like. Unfortunatly, PNCC paper doesn't tell anything about liftering or LDA/MLLT. With LDA the results could be quite different from the ones reported.

Still, PNCC seem to provide quite a good robustness in noise and I think PNCC will provide improved performance for default models. The recent plan is to import PNCC into pocketsphinx and sphinx4 as default features and train the models for them.


Building a Generic Langauge Model

I spent some time recently building a language model from the open Gutenberg texts, it has been released today:

http://cmusphinx.sourceforge.net/2013/01/a-new-english-language-model-release/

Unfortunately, it appeared that it's very hard to build a model which is relatively "generic". The language models are very domain-dependent, it's almost impossible to build a good language model for every possible text.  Books are almost useless for conversational transcription, no matter what amount of book texts do you have. And, you need terabytes of data in order to reduce accuracy just 1%.

Still, the released language model is an attempt to do so. More importantly, the source texts used to build the langauge model are more-or-less widely available, so it will be possible to extend and improve the model in the future.

I found it quite interesting to solve this problem of domain-dependence. Despite the common fact that trigram models work "relatively well", in fact they do not. This survey I find very relevant


TWO DECADES OF STATISTICAL LANGUAGE MODELING WHERE DO WE GO FROM HERE? by Ronald Rosenfeld


Brittleness across domains: Current language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained. For example, to model casual phone conversations, one is much better off using 2 million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio news broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model
trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period...
Recent advances in language models for speech recognition include discriminative language models which are impractical to build unless you have unlimited processing power


And recurrent neural-network language model implemented by RNNLM toolkit

Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černocký Jan, Khudanpur SanjeevRecurrent neural network based language model, In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, JP

RNNLM are becoming very popular and they could provide some significant gains in perplexity and decoding WER, but I don't believe they could solve domain-dependency issue to enable us to build a truely generic model.

So, research on the subject is still needed. Maybe, once the domain-dependency variance could be properly captured, a way more accurate langauge model could be built.




Recent state of UniMRCP

The cool project in CMUSphinx ecosystem is UniMRCP. It implements MRCP protocol and backends which allow to use speech functions from the most common telephony frameworks like Asterisk or Freeswitch.

Beside Nuance, Lumenvox and SpeechPro backends, it supports Pocketsphinx which is amazing and free. The bad side of it is that it's not going to work out of box. The decoder integration is just not done right:

  • VAD does not properly strip silence before it pass audio frames to a decoder, because of that accuracy of recognition is essentially zero.
  • Decoder configuration is not optimal
  • The result is not retrieved as it might be
Also, UniMRCP is not going to work with recent Asterisk releases like 11, it works with 1.6 as far as I see. The new API is not supported.

So, a lot of work is needed to make it actually working and not confusing the users who build their models with CMUSphinx. However, the perspectives are amazing so one might want to spend some time on finalizing Pocketsphinx plugin in UniMRCP. I hope to see it soon.


How To Choose Embedded Speech Recognizer

There are quite many solutions around to build an open source speech recognition system for low-resource device and it's quite hard to choose. For example you need a speech recognition system for a platform like Raspberry Pi and you consider between HTK, CMUSphinx, Julius and many other implementations.

In order to make an informed decision you need to consider a set of features specifically required to run speech recognition in a low-resource environment. Without them your system will probably be accurate but it also will consume too much resources to be useful. Some of them are:


Blizzard Challenge 2012

This year it's a little bit later, but it's amazing that Blizzard Challenge 2012 evaluation is now online.

This year it's going to be very interesting. The data to create the voices is taken from audiobooks, and one part of the test includes synthesis of paragraphs. That means that you can actually estimate how TTS built from a public data can perform.

The links to register are:

For speech experts:
http://groups.inf.ed.ac.uk/blizzard/blizzard2012/english/registerexperts.html

For other volunteers:
http://groups.inf.ed.ac.uk/blizzard/blizzard2012/english/registerweb.html

The challenge was created in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.

Please distribute the second URL as widely as you can - to your colleagues, students, friends, other appropriate mailing lists, social networks, and so on.