Dither is considered harmful

MFCC features used in speech recogition are still a reasonable choice if you want to recognize generic speech. With tunings like frequency wrap for VTLN and MLLT they still can suggest the reasonable performance. Although there are many parameters to tune like upper and lower frequencies, the shape of the mel filters and so on, default values mostly works fine. Still I had to spend this week on one issue related to zero energy frames.

Zero energy frames are quite common in telephony recorded speech. Due to noise cancellation or due to VAD speech compression telephony recordings are full of the frames with zero energy. The issue is that calculation of the MFCC features consist of taking log from eneries, thus you have an undefined value of log 0. There are several ways to overcome this issue.

The one used in HTK or SPTK for example is to assign some floored value to the log, usually it's quite a big value in log domain, say 1e-5. This solution is actually quite bad at least in it's sphinx implementation. That's because it largely affects CMN computation, means goes down and bad things happen. Silent frame can affect the result of the whole phrase.

Another one is dither, when you apply random 1bit noise to the sound as a whole and use this modified waveform for training. Such change is usually enough to make log take acceptable values around -1.

There were complains about dither, most well known one is that it affects recognition scores, results can be different from run to run. It's a bad thing but not that bad when you start with predefined seed. So I thought before that dither is fine. And by default it's applied both in training and decoder. But recently when I started with the testing of the sphinxtrain tutorial I come to more important issue.

See the results on an4 database from run to run without any modifications:

TOTAL Words: 773 Correct: 645 Errors: 139
TOTAL Percent correct = 83.44% Error = 17.98% Accuracy = 82.02%
TOTAL Insertions: 11 Deletions: 17 Substitutions: 111
TOTAL Words: 773 Correct: 633 Errors: 149
TOTAL Percent correct = 81.89% Error = 19.28% Accuracy = 80.72%
TOTAL Insertions: 9 Deletions: 23 Substitutions: 117
TOTAL Words: 773 Correct: 639 Errors: 142
TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 8 Deletions: 19 Substitutions: 115
TOTAL Words: 773 Correct: 650 Errors: 133
TOTAL Percent correct = 84.09% Error = 17.21% Accuracy = 82.79%
TOTAL Insertions: 10 Deletions: 17 Substitutions: 106
TOTAL Words: 773 Correct: 639 Errors: 142
TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 8 Deletions: 19 Substitutions: 115

If you are lucky you can even get WER of 15.95%. Thats certainly unacceptable and it still remains true why training is so sensible to dither applied. Clearly it makes any testing impossible. I checked this results on medium vocabulary 50-hours database and they are still the same - the accuracy is very different from run to run. Interesting thing is only training is affected that much. For testing you can get very slight difference of 0.1%.

So far my solutions are:
  • Disable dither on training
  • Apply a patch to drop frames with zero energy (this seems useless but it helps to be less nervious about warnings)
  • Decode with dither
I hope I'll be able to provide more information in the future about the reasons of this unstability, but for now it's all I know.

Text summarization low hanging fruit

Actually all the data required for quite precise text summarization is almost in place, one should just add support for WordNet from nltk into the Open Text Summarizer, calculate frequencies and present highlighted sentences to the user. Or it's possible to do the same in python with nltk iteself.

It would help in many cases for example in mail processing. Getting 200 mails in a day it's really hard to read them through. Or probably it's just time to unsubscribe from some mailing lists.

My vote for voxforge

I just voted for VoxForge in the category: "Most Likely to Change the Way You Do Everything". You might want to do the same :)

Go to http://sourceforge.net/community/cca09/nominate/?project_name=VoxForge&project_url=http://voxforge.org/

Blizzard Challenge 2009

CSTR and others are pleased to announce that the listening tests for the Blizzard Challenge 2009 are now running. The Blizzard Challenge is an annual open speech synthesis evaluation in which participants build voices using common data, and a large listening test is used to compare them. Participants include some of the leading commercial and academic research groups in the field.

I would appreciate your help in getting as many listeners to participate as possible, by forwarding this message on to other lists, colleagues, students, and of course taking part yourself.

The listening test should take 30-60 minutes to complete, and can be done in stages if you wish. You do not need to be a native speaker of the language in order to take part. There are 4 different start pages for the listening test, as follows:



Speech Experts:

Mandarin Chinese:


Speech Experts:

Whether you consider yourself a 'speech expert' is left to your own

Training the large database trick

Training of the large database requires a cluster. SphinxTrain supports training on Torque:PBS for example, to do this you need to set the following configuration variables:


and set the number of parts to train. The issue is to guess the number of parts. I previously thought

1 part:

TOTAL Words: 773 Correct: 660 Errors: 126
TOTAL Percent correct = 85.38% Error = 16.30% Accuracy = 83.70%
TOTAL Insertions: 13 Deletions: 9 Substitutions: 104

3 parts:

TOTAL Words: 773 Correct: 583 Errors: 262
TOTAL Percent correct = 75.42% Error = 33.89% Accuracy = 66.11%
TOTAL Insertions: 72 Deletions: 17 Substitutions: 173

10 parts:

TOTAL Words: 773 Correct: 633 Errors: 168
TOTAL Percent correct = 81.89% Error = 21.73% Accuracy = 78.27%
TOTAL Insertions: 28 Deletions: 10 Substitutions: 130

20 parts:

TOTAL Words: 773 Correct: 619 Errors: 181
TOTAL Percent correct = 80.08% Error = 23.42% Accuracy = 76.58%
TOTAL Insertions: 27 Deletions: 13 Substitutions: 141

But it appeared that all above is not true. One potential source of problems was that the norm.pl scripts grabs all the sub directories under the bwaccum one indiscriminately. So if there are some old bwaccum dirs left over (e.g. if you train on 20 parts first then start again with 10, without deleting the directories in-between), the norm script will screw up (thanks to David Huggins-Daines for pointing that out to me). In this particular test there was another one that I forgot to update mdef after model rebuild and old scripts didn't do that automatically. On multipart the order of senones in mdef is different thats why there was a regression. Though the set of senones is the same.

So the testing and statements above are completely wrong - accuracy doesn't depend on number of parts used. As expected. This confirms the ground truth that correct experiment statement is the most important thing in research.

Now only one issue left - the dropped accuracy from old tutorial to a new one. But that is a completely different issue discussed in my mails on cmusphinx-sdmeet now.

Bad prompts issue

After quite a lot of training of the model on a small part of database to test things I came to conclusion that the main issue is a bad prompts. Indeed the accuracy on the training set for 4 hours of data with the language model trained on the same training prompts is only 85%. Usually it should be around 93%. The issue here is that real testing prompts are also bad and they should stay that way, otherwise we'll be bounded to high quality speech only. I remember I tried a forced alignment with communicator model before but it didn't improve much just because of the testing set issue. Another try was to use skip state, that was not fruitful as well.

So the plan for now is to choose the subset with the forced alignment again and train the model to check if the hypothesis is true and bad prompts in an acoustic database is indeed a main issue. It looks like we are walking around by the circle.

I ended reading the article titled "Lightly supervised model training"

Speech AdBlock

Inspired by Daniel's Holth application to remove word "twitter" from podcasts:


I think it's a very good idea to implement keyword filter to block advertizing in podcasts. Though support for keyword spotting is not easily implemented with CMUsphinx right now, it should be rather straightforward thing to do. In the end it can be just a binary application that takes a list of keywords to block and just filters mp3 file giving user the same file with blocked advertising.

Cepwin Features Training

Recently the option to bypass delta and delta-delta feature extraction process and directly apply LDA transform matrix to the cepstrum coefficients of sequential frames was added to sphinxtrain. To use it you need to adjust training config and decoder as well:

  • Set feature type to 1s_c
  • Add $CFG_FEAT_WINDOW=3; to the config file
  • Train with MLLT
  • Apply the attached patch to sphinxbase cepwin.diff.
  • Decode

  • You can use these models in sphinx4 now, the following config should do the work:

    <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.ConcatFeatureExtractor">
    <property name="windowSize" value="3"/>
    <component name="lda" type="edu.cmu.sphinx.frontend.feature.LDA">
    <property name="loader" value="sphinx3Loader"/>

    I haven't found the optimal parametrers yet, but it seems that something like cepwin=3 and final dimension around 40 should work. I hope to get results on this soon.