CMU Sphinx Users and Developers Workshop 2010

I'm happy to announce

The First CMU Sphinx Workshop

20 March 2010, Dallas, TX, USA

Event URL:

Papers are solicited for the CMU Sphinx Workshop for Users and Developers (CMU-SPUD 2010), to be held in Dallas, Texas as a satellite to to ICASSP 2010.

CMU Sphinx is one of the most popular open source speech recognition systems. It is currently used by researchers and developers in many locations world-wide, including universities, research institutions and in industry. CMU Sphinx's liberal license terms has made it a significant member of the open source community and has provided a low-cost way for companies to build businesses around speech recognition.

The first SPUD workshop aims at bringing together CMU Sphinx users, to report on applications, developments and experiments conducted using the system. This workshop is intended to be an open forum that will allow different user communities to become better acquainted with each other and to share ideas. It is also an opportunity for the community to help define the future evolution of CMU Sphinx.

We are planning a one-day workshop with a limited number of oral presentations, chosen for breadth and stimulation, held in an informal atmosphere that promotes discussion. We hope this workshop will expose participants to different perspectives and that this in turn will help foster new directions in research, suggest interesting variations on current approaches and lead to new applications.

Papers describing relevant research and new concepts are solicited on, but not limited to, the following topics. Papers must describe work performed with CMU Sphinx:
  • Decoders: PocketSphinx, Sphinx-2, Sphinx-3, Sphinx-4
  • Tools: SphinxTrain, CMU/Cambridge SLM toolkit
  • Innovations / additions / modifications of the system
  • Speech recognition in various languages
  • Innovative uses, not limited to speech recognition
  • Commercial applications
  • Open source projects that incorporate Sphinx
  • Novel demonstrations
Manuscripts must be between 4 and 6 pages long, in standard ICASSP double-column format. Accepted papers will be published in the workshop proceedings.

Important Dates

Paper submission: 30 November 2009
Notification of paper acceptance: 15 January 2010
Workshop: 20 March 2010


Bhiksha Raj - Carnegie Mellon University
Evandro Gouvêa - Mitsubishi Electric Research Labs
Richard Stern - Carnegie Mellon University
Alex Rudnicky - Carnegie Mellon University
Rita Singh - Carnegie Mellon University
David Huggins-Daines - Carnegie Mellon University
Nickolay Shmyrev - Nexiwave
Yannick Estève - Laboratoire d'Informatique de l'Université du Maine


To email the organizers, please send email to

Using HTK models in sphinx4

As from yesterday long waited cool patch by Christophe Cerisara with the help of super fast Yaniv Kunda has landed in svn trunk. Now you can use HTK model directly from sphinx4. Though it's not easy since I spend a few hours today figuring the required issues, so here is a little step-by-step howto:

1. Update to sphinx4 trunk

2. Download small model, because currently binary loading is not supported unfortunately and it takes a lot of resources to load the model from a huge text file. Get a model from Keith Vertanen

3. Convert model to text format with HTK HHEd

mkdir out
touch empty
HHEd -H hmmdefs -H macros -M out empty tiedlist

4. Replace model in Lattice demo in configuration file:

<component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="wsjLoader"/>
<property name="unitManager" value="unitManager"/>
<component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.HTKLoader">
<property name="logMath" value="logMath"/>
<property name="modelDefinition" value="/home/shmyrev/sphinx4/wsj/out/hmmdefs"/>
<property name="unitManager" value="unitManager"/>

Please note here that modelDefinition property points to the location of the newly created hmmdefas file.

5. Replace the frontend configuration to load HTK features from a file. Unfortunately it's impossible to create HTK features with sphinx4 frontend right now, but this will be implemented soon I hope. Some bits are already present like DCT-II transform with frontend.transform.DiscreteCosineTransform2, some are easy to setup like proper filter coefficients, some are missing. So for now we'll recognize MFC file instead.

<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item> streamHTKSource </item>
<component name="streamHTKSource" type="edu.cmu.sphinx.frontend.util.StreamHTKCepstrum">
<property name="cepstrumLength" value="39"/>

and let's change the Java file

StreamHTKCepstrum source = (StreamHTKCepstrum) cm.lookup ("streamHTKSource");
InputStream stream = new FileInputStream(new File ("input.mfc"));

6. Now let's extract mfc. Create a config file for HCopy

TARGETRATE = 100000.0
WINDOWSIZE = 250000.0

and run it

HCopy -C config 10001-90210-01803.wav input.mfc

make sure input.mfc is located in top sphinx4 folder now since this is the place we'll take it.

7. Now everything is ready

ant && java -jar bin/LatticeDemo.jar

Check the result

I heard: once or a zero zero one nine oh to one oh say or oil days or a jury

It's not very precise, but still ok for such a small model and limited language model.

This is still a work in progress and a lot of things still pending. The most important are reading the binary HTK files, frontend adaptation, cleanup and unification. But I really look forward on the results, since it's really a promising approach. There are not so many BSD-licensed HTK decoders out there.

Speech Recognition As Experimental Science

It's well known there are two types of physics - theoretical one and experimental. During the school I always liked doing the last, measuring the speed of a ball or voltage, plotting the graphics and so on. Unforunately in later days I was mostly doing math or programming. Only recently when I started to spend a lot of time on speech recognition I found why do I like it so much - it's also an experimental science.

When you build a speech recogniton system your time is mostly spent on all these beautiful things. Setting up the database training, running the learning process, tracking the results. You are trying understand the nature and find it's laws, you want to find the best feature set, phoneset, find the beams and more and more. You have an experimental material and sometimes it appeared there are things you forget to take in account. The activity that's really encouraging.

Of course there are important drawbacks, issues like proper design of the experiments arise. Unfortunately it's not widely described in the literature but speech recognition experiments are just an examples of experiments so all issues are valid for them. To list a few:
  • Reproducability
  • Connection of the theory and the practice
  • Estimation of the results and their validity
For example the last point is very important. Currently when we are running the the database test we just get a number. We are trying to rely on it without even estimating the deviation and other very important attributes of every scientific measurement. As the result we make unreliable decisions like I did with MLLT transform. I now think that we should be more careful about that.

So that's why I started with the forementioned wikipedia page trying to find a good book on experiment design and of course it would be nice to find an appropriate software for experiment management workflow.

The First Glance On The Interspeech 2009 Papers

Interspeech 2009 in Brighton is over today. Unfortunately I wasn't able to particiapte for various reasons. Still, it was very interesting to review the list of sessions, abstracts and read some articles available. The modern activity in speech research is amazing, the number of articles and groups is enormous, in total I counted 459 abstracts with grep. It was enjoying to process them all. Currently I reduced the list to 50% of the original size so still need a few lookups to find something more interesting. A few random thoughts I've got:

Sphinx is mentioned 2 times and HTK only once :), that's a win. Of course many researches use HTK for experiments. So it's more the win in being more open.

A lot of machine learning research. And quite a significant amount of research is dedicated to another target space representation/classifier/cost function adjustments. The first glance didn't show anything interesting here unfortunately. Discriminative training is probably the most recent advance in ASR.

Still enormous amount of the old style phontic research. Is vowel length a feature? How do Zulu people click? Sometimes it's interesting to read though.

Almost all TTS is about HMM for speech synthesis. The quality of audio for TTS is a problem. I've recenly read the good and very detailed good review by Dr. Zen, even adepts of the approach know that the hybrid of HMM and unit-selection is better.

Suprisingly short section on new methods and paradigms unfortunately.

New trends include emotions, machine speech-to-speech translation, language aquisition. Combination of visual and speech recognition is suprisingly common.

No Russians at all. Well, not strange, Russian speech technology doesn't exist in fact.

The RWTH Aachen University Open Source Speech Recognition System is a terrific news. The source is available, downloaded and ready for investigation.

"Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate", no link available yet unfortunately. Should be a very interesting reading. The only problem that arises here is that someone should do the merge. The issue is that source is available but really it's very hard to integrate with the research-oriented system.

I'm also waiting for Blizzard 2009 results that should be presented but still not available.

A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game - we wanted that for a long time for Voxforge.

In few next posts I'll probably cover some interesting topics in more detail. If you was at the conference or saw something interesting, comments are appreciated.

Modern ASR practices review

I was never able to completely join the scientific world, most probably because engineering tasks are more attractive. Though I graduated as a mathematician, my merits aren't worth mentioning. For example the thing I never liked is writing, in particular writing a scientific article. That's the corner stone of the science now but for me it seems very dated practice. Most articles are never read, huge percent has errors, many are completely wrong or repeat other sources. Of course there are brilliant ones.

From my point of view the knowledge should be probably organized in a different ways, something like a software projects. The theory could be built during ages in a wiki style with all changes tracked and probably contain complimentary information like techinical notes, software implementations, test results, formalized proofs and so on. Of course among software projects there are also issues like forks, bad maintaince and bugs, but it seems they are more organized.

That's why I really like the projects that keep knowledge in a structure like wikipedia, planetmath for example. Also reviews of the state of art are of course invaluable. Today I spent some time processing my library and the found again the wonderful review by Mark Gales:

The Application of Hidden Markov Models in Speech Recognition

I would really recommend this book as a base introduction into modern speech recognition methods. Though written by HTK author, it has little HTK specific and really focused in best practices in ASR systems.

P.S. Is there a personal library management software, web-based, able to store and index PDF? I used to install Dspace at work, but it's so heavy and the UI is really outdated.

Initial value problem for MLLT

So far I've recently discovered with the help of mchammer2007 the problem with estimation of the initial matrix for MLLT training. The MLLT or Maximum Likelihood Linear Transform is suggested by R. A. Gopinath, "Maximum Likelihood Modeling with Gaussian Distributions for Classification", in proceedings of ICASSP 1998 and implemented in Sphinxtrain.

The idea is that the matrix to modify feature space is trained to fix the optimization of the covariances and make covariance matrix look more like the diagonal. The optimization is quite simple gradient descendant but unfortunately it suffers from the initial value problem. That is if you choose proper initial value you could get much better results. So right now random matrix is used:

if A == None:
# Initialize it with a random positive-definite matrix of
# the same shape as the covariances
s = self.cov[0].shape
d = -1
while d < 0:
A = eye(s[0]) + 0.1 * random(s)
d = det(A)

And depending on your luck you could get better or worse recognition results. Sometimes even worse than the usual training without LDA/MLL.

SENTENCE ERROR: 55.4% (72/130) WORD ERROR RATE: 17.5% (135/773)
SENTENCE ERROR: 51.5% (66/130) WORD ERROR RATE: 16.6% (128/773)
SENTENCE ERROR: 50.0% (65/130) WORD ERROR RATE: 15.5% (119/773)
SENTENCE ERROR: 56.2% (73/130) WORD ERROR RATE: 16.9% (130/773)
SENTENCE ERROR: 62.3% (80/130) WORD ERROR RATE: 22.3% (172/773)

So the receipt for the training is the following - train several times and control the accuracy, choose the best MLLT matrix and use it in final trainings. If you have a large database, find best MLLT for a subset of it and use it as an initial value for MLLT estimation. No easier way until we'll find a better method for initial value estimation, quick look on the articles didn't give any.

From recent articles I also got quite a significant collection of LDA derivatives, discriminative ones, HLDA and so on. It would be nice to put them into a some review. Also some of them seems to be free from this initial value problem. It would be nice to get a proper review on this large topic.

Between you can see in the chunk of the code above that the comment is not quite correct. The positive-definiteness of the matrix should be checked differently, with the Silvester criterion for example. Though I think that since the condition det(A) > 0 seems to be enough for the feature space transform, the comment should be simply removed. But probably positive-defined matrix is required for optimization.

Adaptation Methods

It's really hard to collect information on practical application of speech recognition tools. For example the wonderful quote from Andrew Morris on htk-users about what to update during MAP adaptation:

Exactly what it is best to update depends on how much training data you have, but in general it is important to update means and inadvisable to update variances. Only testing on held out test data can decide which is best, but if you are training on data from many speakers and then adapting to data from just one speaker, I expect updating just means should give best results, with variance adaptation reducing performance and transition probs or mix weights adaptation making little difference.

After few experiments I can only confirm this statement. You should never adapt the variances. So, the HOWTO in our wiki is not so good as it could be. Another bit could be taken from this document, actually it's really better to combine MAP and MLLR this way and the best method for offline adaptation is:
  • Run bw to collect statistics
  • Estimate mllr transform
  • Update means with mllr
  • Run bw again with updated means
  • Apply MAP adaptation with fixed tau greater than 100 (try to select the best value). Unfortunately from my experience automatic tau selection is broken in map_adapt. This way you'll update the variances a bit, but only slightly.
No book could tell you that!