Pocketsphinx Success Story

I was pleased to find out renovated website by Keith Vertanen and an amazing real-life example of creation of the pocketsphinx application Parakeet, a dictation app with correction for Nokia N800:


Keith website and models are invaluable resource for Sphinx developers, in particular his lm_giga models are still the models I would recommend to take for adaptation. But seeing this application in action and reading about it's development should really give a good insight into the process of speech recognition application building, having all good practices described.

Core Ideas Behind Speech Recognition

While tunning the acoustic model I've got again 40% WER and in the log the following:

***     HAD  BARELY  LEAD CANDY a   CLASSIC  (a100)
Words: 7 Correct: 1 Errors: 6 Percent correct = 14.29% Error = 85.71% Accuracy = 14.29

If you'll check this recognition error you'll find that it's almost impossible to find the reason of it and fix it. Probably some senone was trained incorrectly, probably CMN give error or clipping made MFCC wrong. Probably some noise in the middle break the search. There is nothing you can do about it. That made me think about foundations of ASR.

Considering a speech recognizer engine like sphinx4 one could extract the set of core ideas that lie behind it. Same ideas are usually described in speech recognition textbook. Basically they are:
  • MFCC feature extraction from periodic frames (or PLP, doesn't matter)
  • HMM classifier for acoustic scoring (with state tying)
  • Trigram word-based language model (higher grams aren't effective, lower not so precise)
  • Dynamic search with pruning
Surely commercial systems have a lot of improvement over this baseline, but the core is still the same. Such foundations are certainly reasonable and checked over the years in practice. It's hard to argue agains them. Often newbies tell that something is wrong here, but basically it's because they don't really understand how it works. Critisizm comes from old-school linguists, who do everything with rules and mostly interested in usual cases like pronuciation of "schedule" than in theory.

The only issue is that growing amount of unsolvable unexplainable problems like the problem with accuracy above breaks this theory. Quite unusual fact for me as mathematician since in mathematics theory rarely become invalid. They tranform, grow but usually all of them are stated once in forever. In natural sciences like physics it's usual. The aether theory and mechanical explanation of gravitation is the good example that come to my mind. So there is nothing wrong that this ideology of speech recognition could be reviewed and modified according to the recent findings.

What would I put into such modified theory:
  • Multiresolution feature extraction. Starting from RASTA to fMPE and spikes. The idea is that signals are sparse and nonperiodic, the signal range from 10ms to more than 10 seconds and they all needs to be passed into classifier.
  • Some acoustic classifier that without selected states. The idea of phone is probably natual in slow speech or in teching but I heard so many complains about it. Dropping it seems promising indeed since speech is a process, not a sequence of states. Unfortunately I haven't found any article on this yet. Another promising idea here is margins which could help with out-of-model sounds.
  • Subword stage. I more think that languages with developed morphology like Turkish is more the rule than the exception. Being able to recognize a large set of words in the language is a core capability of usable recognizer and that forces it to operate on subword units. Even English recognizer could benefit from this.
  • Language model without backoff. I recently had discussion with David about that and would like to thank him for this idea. Indeed counts of the model seems to be a reasonable statistics one could keep and use. But further calculation of the language weight should be modified completely. Again, there must be margin to strip some combinations that will never appear in the language. Such idea of using prohibitive rules stays in my mind for a long time. It would be also nice to find any recent articles on this. But there must be a component that will invalidate the output like "barely lead candy".
  • Machine learning for backoff calculations. In continuation of the previous point, the backoff weight should have much more complex structure. Not only trigrams containing the words need to be taken into account, a semantic class should be counted, trigrams with similar class of words ought to be considered. Today I even had idea to apply machine learning to calculate the backoffs. I'm sure someone did this before, also need to look at articles about using machine learning methods to restrict search.
As for tree search, it luckily will stay as is, nothing to argue against it right now. Not sure that such modifications are breaking the initial theory, one could say they aren't really different. I still think they could explain the speech better and help to build better speech recognizer.

New CMUSphinx Website Alpha

Most CMU Sphinx websites are outdated. The problems with the one at sourceforge are:
  • Not so modern style
  • No interactivity
  • Loosely organized outdated information
  • Hard to manage/update
  • No CMS/search
Also there is a generic problem with the quality of documentation available. A lot is quite outdated and just confusing.

So I wanted to build a new website for a long time. This site is supposed to be central point for all sphinx tools, including pocketsphinx, sphinx4, cmuclmtk and sphinxtrain.

New website is supposed to be interesting. This site is going to bring more interactivity (sharing, blog posts, voting, comments). It looks a little bit bloggish, but I think it's even better. It would be harder to write more interesting posts, so I invite everyone to participate. I'm sure you have something to say.

So here is the proposed demo version

We are in process of tranferring the information to the new website, so I really hope to see it running very soon.

How to create a speech recognition application for your needs

Sometimes people ask: why there is no high-quality open source speech recognition applications (dictation application, IVR applications, closed-captions alignment, language acquisition and so on). The answer obviously is that nobody wrote them and make them public. It's often noted, for example by Voxforge, that we lack the database for the acoustic model. I admit Voxforge have it's reason to state we need a database. But that's only a little part of the problem, not entirely the problem as a whole.

And as it always happens, the statement of the question doesn't allow constructive answer on it. To get constructive answer you need the following question: How do I create a speech recognition application.

To answer on this let me provide an example. Consider we want to develop flash-based dictation website. The dictation application consists of the following parts which should be created

  • Website, user accounting, user-dependent information storage 
  • Initial acoustic and language models trained with Voxforge audio and other free sources transmitted through Flash codecs
  • Recognizer setup to convert incoming streams into text. Distributed computation framework for the recognizer
  • Recognizer frontend with noise cancellation and VAD
  • Acoustic model adaptation framework to let user adapt the generic acoustic model to their pronunciation 
  • Language model adaptation framework
  • Transcription control package that will process commands during dictation like error correction ones or punctuation ones.
  • Post-processing package to put punctuation and capitalization, date and acronym post-processing
  • Test framework for dictation with dictation recordings and ability to check dictation effectiveness
Everything above could be done with open source tools and have approximately equal complexity and require minimum specialized knowledge. Performance-wise this system should be usable for a large vocabulary dictation for a wide range of users. The core components are:
  • Red5 streaming server
  • Adobe Flex SDK
  • Sphinx4
  • Sphinxtrain
  • Language model toolkit
  • Voxforge acoustic database
So you see mostly it's just an implementation of the existing algorithms and technologies. No rocket science. This makes me think that such application is just a matter of time.

Multiview Representations On Interspeech

From my experience, in every activity it's important to have multilevel view of any activity, interesting is that it's both part of Getting Things Done and just a good practice in software development. Multiple models of the process or just different views help to understand what's going on. The only problem is to make those views consistent. That reminds me the Russian model of the world.

So it's actually very interesting to get a high-level overview of what's going on in speech recognition. Luckily to do that you just need to review some conference materials or journal articles. Latter is more compicated, while former is feasible. So here comes some topics from the plenary talks from Interspeech. Suprisingly they are rather consistent across each other and I hope they really present trends, not just selected topics.

Speech To Information
by Mari Ostendorf

Multilevel representation gets more and more important, in particular in speech recognition. The most complicated task - spontaneous meetings recording requires unifiication of the recognition efforts on all levels from acoustic representation to semantic one. Nice to call this approach "Speech To Information", as a result of speech recogntion not just the words are repaired but even syntactic and semantic structure of the talk. One of the interesting tasks is for example restoration of punctuation and capitalization, something that SRILM does.

Good thing is that testing database for such material is already available for free download. Very uncommon situation to have such representative database in free access. AMI corpus looks like an amazing piece of work.

Singe Method
by Sadaoki Furui

WFST-based T3 decoder looks quite impressive. Single method of data representation used everywhere which more importantly allows combination of the models gives wonderful opportunity. For example consider the example of building high-quality Icelandic ASR system combining WFST for English one and very basic Icelandic one. I imagine the decoder is really simple since basically all structures including G2P rules, language and acoustic model could be weighted finite-state automata.

Bayesian Learning
by Tom Griffiths

Hierachical bayesian learning and things like compressed sensing seems to be a hot topics in mashine learning. Google does that. There are already some efforts to impelement a speech recognizer based on hierachical bayesian learning. Indeed it looks impressive to just feed the audio to the recognizer and make it understand you.

Though probabilistic point of few was always questionable opposed to precise discriminative methods like MPE I'm still looking forward to see progress here. Despite huge amount of audio is required, like I remember there were estimates about 100000 hours I think it's feasible nowdays. For example it already recognizes written digits, so success looks really close. And again, it's also multilevel!

Few open source speech projects

It's interesting that a lot of activity around speech software happen recently. I'm probably too impatient trying to track everything interesting. Even through ISCA-students added twitter feed recently, their website still needs a lot of care. Hopefully, Voxforge will become such resource one day. There is a growing amount of packages, tools, projects and events.

For example I've got in touch with SEMAINE project lead by DFKI recently, an effort to build a multimodal dialogue system which ca, interact with humans with a virtual character, sustain an interaction with a user for some time and react appropriately to the user's non-verbal behaviour. The sources are available and the new release is expected in December as far as I understood, so I'm definitely looking forward. The interesting thing is that SEMAINE incorporates emotion recognition framework with libSVM as a classifier, such framework would be useful in sphinx4 for example. Actually a lot of news come now from the European research institutes, projects from RWTH or TALP promise a lot.

Another example is that I was pleased to find out that in 2009 there was a rich transcription evaluation. It's interesting why results aren't available still and what was the progress on meeting transcription task since 2007.

Probably I would sleep better if I didn't knew all above :)

Using SRILM server in sphinx4

Recently I've added the support for the SRILM language model server to the sphinx4 so it's possible to use much bigger models during the search keeping the same memory requriements and, more important, during lattice rescoring. Lattice rescoring is still in progress, so here is the idea how to use network language model during search.

SRILM has a number of adavantages for example it implements few interesting algorithms and even for simple tasks like trigram language model creation it's way better than cmuclmtk. At least model pruning is

To start first dump the language model vocabulary since it's required in linguist

ngram -lm your.lm --write-vocab my.vocab

So start the server with

ngram -use-server 5000 -lm your.lm

Configure the recognizer

<component name="rescoringModel"
   <property name="port" value="5000"/>
   <property name="location" value="your.vocab"/>
   <property name="logMath" value="logMath"/>

And start the lattice demo. You'll see the result soon.

Adjust the cache according to the size of your model. It shoudlnt' be large for a simple search. Typically the cache size isn't more than 100000 for a simple search.

Still, usage of the large-gram model is not reasonable for a typical search because of the large amount of word trigrams that should be tracked. It's more efficient to use trigram or even bigram model first and make a second recognizer pass with the rescored language model. More details on rescoring in the next posts.

Rhythm of British English in Festival

Interesting how ideas rise from time to time in seemingly unrelated places. Recently I've read nice post in John Well's blog about the proper RP English rhythm and now that issue raised again in gnuspeech mailing list where Dr. Hill cited his work

JASSEM, W., HILL, D.R. & WITTEN, I.H. (1984) Isochrony in English speech: its statistical validity and linguistic relevance. Pattern, Process and Function in Discourse Phonology (collection ed. Davydd Gibbon), Berlin: de Gruyter, 203-225 (J)

I spend some time thinking about how this rhythm is handled in Festival and came to the conclusion there is no such entity there. Probably it's somehow handled by CART for duration and intonation prediction, but not as a separate entity. Though many voices are supposed to be US English, I still think they can benefit from a proper rhythm prediction. Try the example from the movie, "This the house that Jack built" with artic voices. Check if Jack gets enough stress.

Blizzard 2009 results available

It was pleasant to find out that results of the Blizzard Challenge 2009 are now available. Thanks a lot ot organizers and participants!

Reading the articles took me half of the day trying to solve usual Einstein-type puzzle of figuring out who give the best results there and what was changed. Unfortunately it takes to much time to read everything in details. There is no summary on methods/systems used this year, the archivements from the last year and explanations of the results provided. I could only start with the following:

  1. iFlytek Speech Lab and IVO Software are still the best. Unit selection systems win.
  2. DFKI which I was fan of can't unfortunately jump to a commercial level even with unit selectoin. That probably means that not only unit selection is a key issue.
  3. I like the progress muXac and Mike are doing over years.
  4. ES3 task with building voice from small amount of speech is kind of senseless. Don't we want to use voice adaptation in this case 
  5. Interesting that machine learning for join and target cost optimization is popular nowdays
  6. Though there was telephone TTS task it seems for me that nobody did anything related to the TTS over the telphone lines. The differences shouldn't be large, only 8kHz is the issue or even the advantage, but even this moment is not covered in any articles or at least I didn't notice it.
Short summary on systems:
  • Aholab - unit selection, spent one day on building the voice so nothing good to expect
  • WISTON - Mandarin prosody is a key feature, but article doesn't describe challenge
  • Cereproc - experiment with combining HTS and unit selection, bad results or unknown reason, 4 man-days spent
  • CMU - article is not available, but you can try clustergen yourself in stock festival
  • CSTR - CSTR has started investigations on HTS methods. Good start, no results yet.
  • DFKI - spent year on adding Turkish TTS and Mary 4.0 implementation
  • Edinburgh/Idiap - interesting unsupervised entry, results are obvioulsy lower
  • I2R - good TTS, unit selection
  • Ivona - unit selection with pitch modifications by interestingly named algorithm, best English one together with iFlytek
  • CircumReality - unit selection with pitch modification by TD-PSOLA, best progress over years
  • NICT - HTS, GV, MGE and a lot of math
  • NIT - HTS with STRAIGHT, best HTS here, best Mandarin as well
  • NTUT - Mandarin HTS, not so interesting
  • PKU - Another Mandarin HTS with STRAGHT
  • Toshiba - Good unit selection system, interesting method about fuzzy combining units.
  • iFlytek - HMM-driven unit selection, best English one together with Ivona.
  • VUB - unit selection with WPSOLA, average, though interesting link on SPRAAK open source recognition toolkit, which is not completely open but has interesting description.
Still, the challenge itself is very interesting and I'm looking forward on the next challenge results.

Another cool bit if hardware for database training.

It's sometimes hard to adopt quickly the new opportunities world provide. I'm being reading now Innovator's Dilemma by Clayton M. Christensen. Thanks to Ellias for the advice, it really seems like a good book.

The interesting thing is that author starts with a description of hard drive industry as the fastest one with innovations going faster than customer needs. And, what do you think? Hard drive industry strikes back with SSD drives. Well, I read they exist but didn't understand their value for acoustic model training. Even without profiling it's clear they will be extremely useful. 

Say you have a medium size acoustic database of 60 hours of few gigabytes size. If you want to process it fast you need to use 8-core machine. Here comes the bottleneck, imagine 8 processes reading the feature vectors from a disk in an almost random way. No need to guess hard drive will be very busy trying to fetch all data required. SSD could definitely help here, I really need to try it soon.

CMU Sphinx Users and Developers Workshop 2010

I'm happy to announce

The First CMU Sphinx Workshop

20 March 2010, Dallas, TX, USA

Event URL: http://www.cs.cmu.edu/~sphinx/Sphinx2010

Papers are solicited for the CMU Sphinx Workshop for Users and Developers (CMU-SPUD 2010), to be held in Dallas, Texas as a satellite to to ICASSP 2010.

CMU Sphinx is one of the most popular open source speech recognition systems. It is currently used by researchers and developers in many locations world-wide, including universities, research institutions and in industry. CMU Sphinx's liberal license terms has made it a significant member of the open source community and has provided a low-cost way for companies to build businesses around speech recognition.

The first SPUD workshop aims at bringing together CMU Sphinx users, to report on applications, developments and experiments conducted using the system. This workshop is intended to be an open forum that will allow different user communities to become better acquainted with each other and to share ideas. It is also an opportunity for the community to help define the future evolution of CMU Sphinx.

We are planning a one-day workshop with a limited number of oral presentations, chosen for breadth and stimulation, held in an informal atmosphere that promotes discussion. We hope this workshop will expose participants to different perspectives and that this in turn will help foster new directions in research, suggest interesting variations on current approaches and lead to new applications.

Papers describing relevant research and new concepts are solicited on, but not limited to, the following topics. Papers must describe work performed with CMU Sphinx:
  • Decoders: PocketSphinx, Sphinx-2, Sphinx-3, Sphinx-4
  • Tools: SphinxTrain, CMU/Cambridge SLM toolkit
  • Innovations / additions / modifications of the system
  • Speech recognition in various languages
  • Innovative uses, not limited to speech recognition
  • Commercial applications
  • Open source projects that incorporate Sphinx
  • Novel demonstrations
Manuscripts must be between 4 and 6 pages long, in standard ICASSP double-column format. Accepted papers will be published in the workshop proceedings.

Important Dates

Paper submission: 30 November 2009
Notification of paper acceptance: 15 January 2010
Workshop: 20 March 2010


Bhiksha Raj - Carnegie Mellon University
Evandro Gouvêa - Mitsubishi Electric Research Labs
Richard Stern - Carnegie Mellon University
Alex Rudnicky - Carnegie Mellon University
Rita Singh - Carnegie Mellon University
David Huggins-Daines - Carnegie Mellon University
Nickolay Shmyrev - Nexiwave
Yannick Estève - Laboratoire d'Informatique de l'Université du Maine


To email the organizers, please send email to sphinx+workshop@cs.cmu.edu

Using HTK models in sphinx4

As from yesterday long waited cool patch by Christophe Cerisara with the help of super fast Yaniv Kunda has landed in svn trunk. Now you can use HTK model directly from sphinx4. Though it's not easy since I spend a few hours today figuring the required issues, so here is a little step-by-step howto:

1. Update to sphinx4 trunk

2. Download small model, because currently binary loading is not supported unfortunately and it takes a lot of resources to load the model from a huge text file. Get a model from Keith Vertanen


3. Convert model to text format with HTK HHEd

mkdir out
touch empty
HHEd -H hmmdefs -H macros -M out empty tiedlist

4. Replace model in Lattice demo in configuration file:

<component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="wsjLoader"/>
<property name="unitManager" value="unitManager"/>
<component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.HTKLoader">
<property name="logMath" value="logMath"/>
<property name="modelDefinition" value="/home/shmyrev/sphinx4/wsj/out/hmmdefs"/>
<property name="unitManager" value="unitManager"/>

Please note here that modelDefinition property points to the location of the newly created hmmdefas file.

5. Replace the frontend configuration to load HTK features from a file. Unfortunately it's impossible to create HTK features with sphinx4 frontend right now, but this will be implemented soon I hope. Some bits are already present like DCT-II transform with frontend.transform.DiscreteCosineTransform2, some are easy to setup like proper filter coefficients, some are missing. So for now we'll recognize MFC file instead.

<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item> streamHTKSource </item>
<component name="streamHTKSource" type="edu.cmu.sphinx.frontend.util.StreamHTKCepstrum">
<property name="cepstrumLength" value="39"/>

and let's change the Java file

StreamHTKCepstrum source = (StreamHTKCepstrum) cm.lookup ("streamHTKSource");
InputStream stream = new FileInputStream(new File ("input.mfc"));

6. Now let's extract mfc. Create a config file for HCopy

TARGETRATE = 100000.0
WINDOWSIZE = 250000.0

and run it

HCopy -C config 10001-90210-01803.wav input.mfc

make sure input.mfc is located in top sphinx4 folder now since this is the place we'll take it.

7. Now everything is ready

ant && java -jar bin/LatticeDemo.jar

Check the result

I heard: once or a zero zero one nine oh to one oh say or oil days or a jury

It's not very precise, but still ok for such a small model and limited language model.

This is still a work in progress and a lot of things still pending. The most important are reading the binary HTK files, frontend adaptation, cleanup and unification. But I really look forward on the results, since it's really a promising approach. There are not so many BSD-licensed HTK decoders out there.

Speech Recognition As Experimental Science

It's well known there are two types of physics - theoretical one and experimental. During the school I always liked doing the last, measuring the speed of a ball or voltage, plotting the graphics and so on. Unforunately in later days I was mostly doing math or programming. Only recently when I started to spend a lot of time on speech recognition I found why do I like it so much - it's also an experimental science.

When you build a speech recogniton system your time is mostly spent on all these beautiful things. Setting up the database training, running the learning process, tracking the results. You are trying understand the nature and find it's laws, you want to find the best feature set, phoneset, find the beams and more and more. You have an experimental material and sometimes it appeared there are things you forget to take in account. The activity that's really encouraging.

Of course there are important drawbacks, issues like proper design of the experiments arise. Unfortunately it's not widely described in the literature but speech recognition experiments are just an examples of experiments so all issues are valid for them. To list a few:
  • Reproducability
  • Connection of the theory and the practice
  • Estimation of the results and their validity
For example the last point is very important. Currently when we are running the the database test we just get a number. We are trying to rely on it without even estimating the deviation and other very important attributes of every scientific measurement. As the result we make unreliable decisions like I did with MLLT transform. I now think that we should be more careful about that.

So that's why I started with the forementioned wikipedia page trying to find a good book on experiment design and of course it would be nice to find an appropriate software for experiment management workflow.

The First Glance On The Interspeech 2009 Papers

Interspeech 2009 in Brighton is over today. Unfortunately I wasn't able to particiapte for various reasons. Still, it was very interesting to review the list of sessions, abstracts and read some articles available. The modern activity in speech research is amazing, the number of articles and groups is enormous, in total I counted 459 abstracts with grep. It was enjoying to process them all. Currently I reduced the list to 50% of the original size so still need a few lookups to find something more interesting. A few random thoughts I've got:

Sphinx is mentioned 2 times and HTK only once :), that's a win. Of course many researches use HTK for experiments. So it's more the win in being more open.

A lot of machine learning research. And quite a significant amount of research is dedicated to another target space representation/classifier/cost function adjustments. The first glance didn't show anything interesting here unfortunately. Discriminative training is probably the most recent advance in ASR.

Still enormous amount of the old style phontic research. Is vowel length a feature? How do Zulu people click? Sometimes it's interesting to read though.

Almost all TTS is about HMM for speech synthesis. The quality of audio for TTS is a problem. I've recenly read the good and very detailed good review by Dr. Zen, even adepts of the approach know that the hybrid of HMM and unit-selection is better.

Suprisingly short section on new methods and paradigms unfortunately.

New trends include emotions, machine speech-to-speech translation, language aquisition. Combination of visual and speech recognition is suprisingly common.

No Russians at all. Well, not strange, Russian speech technology doesn't exist in fact.

The RWTH Aachen University Open Source Speech Recognition System is a terrific news. The source is available, downloaded and ready for investigation.

"Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate", no link available yet unfortunately. Should be a very interesting reading. The only problem that arises here is that someone should do the merge. The issue is that source is available but really it's very hard to integrate with the research-oriented system.

I'm also waiting for Blizzard 2009 results that should be presented but still not available.

A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game - we wanted that for a long time for Voxforge.

In few next posts I'll probably cover some interesting topics in more detail. If you was at the conference or saw something interesting, comments are appreciated.

Modern ASR practices review

I was never able to completely join the scientific world, most probably because engineering tasks are more attractive. Though I graduated as a mathematician, my merits aren't worth mentioning. For example the thing I never liked is writing, in particular writing a scientific article. That's the corner stone of the science now but for me it seems very dated practice. Most articles are never read, huge percent has errors, many are completely wrong or repeat other sources. Of course there are brilliant ones.

From my point of view the knowledge should be probably organized in a different ways, something like a software projects. The theory could be built during ages in a wiki style with all changes tracked and probably contain complimentary information like techinical notes, software implementations, test results, formalized proofs and so on. Of course among software projects there are also issues like forks, bad maintaince and bugs, but it seems they are more organized.

That's why I really like the projects that keep knowledge in a structure like wikipedia, planetmath for example. Also reviews of the state of art are of course invaluable. Today I spent some time processing my library and the found again the wonderful review by Mark Gales:

The Application of Hidden Markov Models in Speech Recognition

I would really recommend this book as a base introduction into modern speech recognition methods. Though written by HTK author, it has little HTK specific and really focused in best practices in ASR systems.

P.S. Is there a personal library management software, web-based, able to store and index PDF? I used to install Dspace at work, but it's so heavy and the UI is really outdated.

Initial value problem for MLLT

So far I've recently discovered with the help of mchammer2007 the problem with estimation of the initial matrix for MLLT training. The MLLT or Maximum Likelihood Linear Transform is suggested by R. A. Gopinath, "Maximum Likelihood Modeling with Gaussian Distributions for Classification", in proceedings of ICASSP 1998 and implemented in Sphinxtrain.

The idea is that the matrix to modify feature space is trained to fix the optimization of the covariances and make covariance matrix look more like the diagonal. The optimization is quite simple gradient descendant but unfortunately it suffers from the initial value problem. That is if you choose proper initial value you could get much better results. So right now random matrix is used:

if A == None:
# Initialize it with a random positive-definite matrix of
# the same shape as the covariances
s = self.cov[0].shape
d = -1
while d < 0:
A = eye(s[0]) + 0.1 * random(s)
d = det(A)

And depending on your luck you could get better or worse recognition results. Sometimes even worse than the usual training without LDA/MLL.

SENTENCE ERROR: 55.4% (72/130) WORD ERROR RATE: 17.5% (135/773)
SENTENCE ERROR: 51.5% (66/130) WORD ERROR RATE: 16.6% (128/773)
SENTENCE ERROR: 50.0% (65/130) WORD ERROR RATE: 15.5% (119/773)
SENTENCE ERROR: 56.2% (73/130) WORD ERROR RATE: 16.9% (130/773)
SENTENCE ERROR: 62.3% (80/130) WORD ERROR RATE: 22.3% (172/773)

So the receipt for the training is the following - train several times and control the accuracy, choose the best MLLT matrix and use it in final trainings. If you have a large database, find best MLLT for a subset of it and use it as an initial value for MLLT estimation. No easier way until we'll find a better method for initial value estimation, quick look on the articles didn't give any.

From recent articles I also got quite a significant collection of LDA derivatives, discriminative ones, HLDA and so on. It would be nice to put them into a some review. Also some of them seems to be free from this initial value problem. It would be nice to get a proper review on this large topic.

Between you can see in the chunk of the code above that the comment is not quite correct. The positive-definiteness of the matrix should be checked differently, with the Silvester criterion for example. Though I think that since the condition det(A) > 0 seems to be enough for the feature space transform, the comment should be simply removed. But probably positive-defined matrix is required for optimization.

Adaptation Methods

It's really hard to collect information on practical application of speech recognition tools. For example the wonderful quote from Andrew Morris on htk-users about what to update during MAP adaptation:

Exactly what it is best to update depends on how much training data you have, but in general it is important to update means and inadvisable to update variances. Only testing on held out test data can decide which is best, but if you are training on data from many speakers and then adapting to data from just one speaker, I expect updating just means should give best results, with variance adaptation reducing performance and transition probs or mix weights adaptation making little difference.

After few experiments I can only confirm this statement. You should never adapt the variances. So, the HOWTO in our wiki is not so good as it could be. Another bit could be taken from this document, actually it's really better to combine MAP and MLLR this way and the best method for offline adaptation is:
  • Run bw to collect statistics
  • Estimate mllr transform
  • Update means with mllr
  • Run bw again with updated means
  • Apply MAP adaptation with fixed tau greater than 100 (try to select the best value). Unfortunately from my experience automatic tau selection is broken in map_adapt. This way you'll update the variances a bit, but only slightly.
No book could tell you that!

Time to buy the new video card

Everybody plays with training and recognition on a GPU now. 200x improvement on NVidia CUDA is worth time and money. Moreover that the sample code is already available.

Thanks to prym on #cmusphinx channel on irc for the link and of course huge thanks Chuan Liu for the article and a new project. It would be so great to see the similar patches for Sphinxtrain/sphinx4, it would be the killer feature.

How to improve accuracy

People very often ask "how to improve accuracy?". Since I've got three questions like this today I decided to write more or less extensive description of the ways to solve this problem. Probably it will be a bit sketchy, but I hope it will be helpful. Corrections will be very appreciated as well.

1) Well, first of all let me mention that the problem is complex. It requires an understanding of the Hidden Markov Models, beam search, language modelling and every other technology involved. I really recommend you to read the book on speech recognition first. This one is very good:


In theory it will be possible to build a speech recognition system without any extensive knowledge as listed above, but it's not the case now. We are working on easy to use system but we are still in the early beginning.

Probably you don't have time to read and study all this. Then think if you have time to implement speech recognition system at all. At least learn basic concepts.

Please also learn how to do programming before you are starting. I think it's obvious you need to have software development experience. We can't teach you Java really.

2) Next step is to setup the basic example of the system and estimate it's accuracy while first part is quite obvious, second is often ignored. Don't do that. It's critical to test accuracy on realistic conditions during whole development process.

Decide what kind of system are you going to implement - large vocabulary dictation, medium vocabulary names recognition or small vocabulary command and control or some other task. Probably you need IVR system. For each system there is a demo already. Use it as a base. Please don't try to build dictation from a command and control demo, it's just not suitable.

Now the important task, the estimation of the accuracy. The examples how to do that could be found in decoder sources, in tutorial and in many more places. Collect test database, recognize it and compute the exact accuracy estimation.

It should be the following:

Command and control 5% WER (word error rate)
Medium vocabulary 15% WER
Large vocabulary 30% WER
Large vocabulary short utterances 50% WER

if you have noisy audio or some accent multiply this number on 2.

4) Compare the actual accuracy with the expected value. If the accuracy is mostly the expected, proceed to the next step. If not, search for the bug. Most likely you've made a mistake in system setup. Check the configuration if it is suitable for the task, check sound speech quality with sound editing, find out the spectrum range, check the sampling rate, accent, dictionary and other parameters. If your WER is 90%, you made a mistake for sure.

For example of task-dependent training consider acoustic database. If you train a small vocabulary acoustic model you need word-based acoustic model with word-based phoneset in Sphinxtrain case. If you are training large vocabulary database make sure your phoneset is not large and make sure you have selected the proper number of senones/mixtures.

5) Once you've reached a baseline, it will take a lot to improve it. Think if it's enough for you and if you can build your application with such accuracy. It's unlikely you'll get significantly more. But if you are brave enough:

  • Use MLLT/VTLN feature space adaptation
  • Use MLLR and other type of online speaker adaptation
  • Adapt language model, use context-sensetive language models
  • Tune beams - try different values and experiment with them
  • Implement the rejection for OOV (out-of-vocabulary) words and other noise sounds
  • Implement noise cancellation.
  • Adapt acoustic models and dictionary to your speakers/their accent.
  • ....
The out-of-vocabulary words are the most frequent issue here. Unlike expected by user, most demos don't do any OOV filtering out of box while it's critical for applications. Unfortunately though unlimited vocabulary systems exist, they are quite complex (though also possible to implement). Most systems have limited vocabulary. That means that you need to implement OOV detection and/or confidence scoring in order to filter the garbage. This is doable and described in demos too, for example in confidence demo.

If you are short of ideas here, join the mailing list, we have a lot of features to implement.

Sphinx4-1.0beta3 is released

The best speech recognition engine is on it's way to world's domination. We are happy to announce the new sphinx4 release. This is still a development version, so bug reports and testing are very appreciated.



New Features and Improvements:
  • BatchAGC frontend component
  • Complete transition to defaults in annotations
  • ConcatFeatureExtrator to cooperate with cepwin models
  • End of stream signals are passed to the decoder for end of stream handling
  • Timer API improvement
  • Threading policy is changed to TAS
Bug fixes:
  • Fixes reading UTF-8 from language model dump
  • Huge memory optimization of the lattice compression
  • More stable fronend work with DataStart and DataEnd and optional SpeechStart/SpeechEnd

Yaniv Kunda, Michele Alessandrini, Holger Brandl, Timo Baumann, Evandro Gouvea

Release of the Polish voice for Festival

Very remarkable and long waited release happened recently. The Polish multisyn voice for the Festival TTS system was made available. This is the best multisyn voice available nowdays both in terms of speech material (several hours, much more than any arctic database, around 500 Mb of audio) and label quality (it has manually corrected segment labels). Also it uses some unique synthesis method modifications like target f0 prediction for multisyn combined with ToBI/APML-based intonation module. The scheme code also has some important modifications. I really encourage you to try this voice even if you don't understand Polish. I also look forward into the HTS voices

Thanks a lot to Krzystof for his hard work.

Training language model with fragments

Sequitur g2p by M. Bisani and H. Ney. is a cool package for the letter to phone translation, quite accurate and, the most important, open. But actually there are different hidden gems in this package :)

One of them is the phone-oriented segmenter that splits the words on chunks - graphones. Graphone is a joint object consisting of letters and corresponding phones that combine words. Graphones are used in g2p internally, but for example they are very useful in construction of the open vocabulary models. The system as a whole is described here:

Open Vocabulary Spoken Term Detection Using Graphone-Based Hybrid recognition System by M. Acbacak, D. Virgyri and A. Stolke

and the details of the language model in the original article:

Open Vocabulary Speech Recognition with Flat Hybrid Models by Maximilian Bisani and Hermann Ney

The interesting thing is that all required components are already available, the issue is to find correct option and build the system. So the quick reciept is:

1. Get Sequitur G2p
2. Patch it to support Python 2.5 (replace elementtree with xml.etree, since elementtree is deprecated now)
3. Convert cmudict lexicon to xml-based Bliss format (I'm not sure what's it, I failed to find information about it on the web)

import sys
import string
print "<lexicon>"
file = open(sys.argv[1], "r")

for line in file:
toks = line.strip().split()
if len(toks) < 2:
word = toks[0]
phones = string.join(toks[1:]," ")
print "<orth>"
print word
print "</orth>"
print "<pron>"
print phones
print "</pron>"
print "</lexicon>"

4. Train the segmenter model. The most complicated thing is to figure option to train multigram model with several phones. Default one used in g2p consist of 1 phone and 1 letter, it's not suitable for OOV language model.

g2p.py --model model-1 --ramp-up --train cmudict.0.7a.train --devel 5% --write-model model-2 -s 0,2,0,2

5. Ramp up the model to make it more precise
6. Build the language model, here you need the dictionary in XML format. As the article above describes, the original lexicon should be around 10k, the subliminal training lexicon should be 50k or so.

makeOvModel.py --order=4 -l cmudict.xml --subliminal-lexicon=cmudict.xml.test -g model-2 --write-lexicon=res.lexicon --write-tokens=res.tokens

After that you can get a tokens for lm and with additional options even a counts for the language model you could train with SRILM. I haven't finished the previous step yet, so this post should have follow up.

I'm going to ClueCon

This August I'm going to US again to Chicago to ClueCon where I'll give the talk titled "The use of open source speech recognition". Here is the small outline:

The most complicated thing in modern ASR is to make user expectations agree with the actual capabilities of the technology. Although the technology itself is able to provide a number of potentially very useful features, they are not exactly what average user expects.

Many specialized tasks require a huge amount of customization, for example speaker adaptation needs to be accurately embedded into the accounting system in order to let recognizer improve the accuracy.

The open source solutions could help here because of much greater flexibility they have. But although many companies provide speech recognition services only several projects exist and most of them are purely academic. They often require a lot of tuning for the end-user. Many parts of the complete system are just missing.

Luckily the situation is going to improve during last years, the core components are going to have more or less stable release schedule and active support including a commercial one.

The purpose of this talk is to cover the trends of the development of open source based speech recognition in conjunction with the telephony systems and suggest a ways it can reach enterprise level.

I'll also visit Boston for two days

Update: Here is the presentation

Gran Canaria

I'm on Gran Canaria!

Left Zyxel

Recently I've left Zyxel where I worked for three years on a Linux-based home class router (CPE). It was a nice place to where I've met wonderful friends and learned a lot. It was also encouraging and interesting work.

I still have a few ideas for the CPE market that are not completely supported in current developments of various competitors. Things like modern web-2.0 dynamic UI, better error reporting, overall performance optimization, testing, centralized management and so on promise a lot for a vendor that will be able to handle such a suprisingly complicated product like CPE. Cell phones are much more active market comparing to routers, although the class of home devices seems to be no less important then cell phone one. For example I amazed by overall T-Mobile G1 quality and haven't seen anything comparable on router's market. Probalby US market is different though. Recent future with 4G, extensibility and wideband access everywhere promise a lot.

Let's hope one day there will be a team that could implement it. Also I hope to buy a router made by me very soon.

Java bits in Sphinx4

I spent some time converting sphinx4 to Java5, mostly changing the loops dropping iterators. I hope it will not just make the code cleaner but give us a few bits of performance.

Also tried to profile sphinx4 lattice code. It seems that I broke it by changing the default value for keepAllStates to true. With all HMM states left in a tree it's very hard to traverse the tree to create a lattice. Unfortunately TPTP profiler in Eclipse appeared to be very slow, looking for profiler now as well as on the way to solve this keepAllStates issue.

That change had another drawback, now we doing very unnatural work when we decide if stream is over. Currently scorer returns null on every SpeechEnd signal. It should also return null on DataEnd signal and that null should be different from the first one since we should stop the recognition only after DataEnd in case of long wav file transcription and continue after SpeechEnd. Now we distinguish them by presence of the data frames which are kept due to keepAllTokens settings. Very unnatural dependency which has drawbacks as well at least it eats memory. But I haven't decided what do do with this not so perfect API yet. Most probably we'll need to introduce something like EOF in C to find out if stream is over.

Dither is considered harmful

MFCC features used in speech recogition are still a reasonable choice if you want to recognize generic speech. With tunings like frequency wrap for VTLN and MLLT they still can suggest the reasonable performance. Although there are many parameters to tune like upper and lower frequencies, the shape of the mel filters and so on, default values mostly works fine. Still I had to spend this week on one issue related to zero energy frames.

Zero energy frames are quite common in telephony recorded speech. Due to noise cancellation or due to VAD speech compression telephony recordings are full of the frames with zero energy. The issue is that calculation of the MFCC features consist of taking log from eneries, thus you have an undefined value of log 0. There are several ways to overcome this issue.

The one used in HTK or SPTK for example is to assign some floored value to the log, usually it's quite a big value in log domain, say 1e-5. This solution is actually quite bad at least in it's sphinx implementation. That's because it largely affects CMN computation, means goes down and bad things happen. Silent frame can affect the result of the whole phrase.

Another one is dither, when you apply random 1bit noise to the sound as a whole and use this modified waveform for training. Such change is usually enough to make log take acceptable values around -1.

There were complains about dither, most well known one is that it affects recognition scores, results can be different from run to run. It's a bad thing but not that bad when you start with predefined seed. So I thought before that dither is fine. And by default it's applied both in training and decoder. But recently when I started with the testing of the sphinxtrain tutorial I come to more important issue.

See the results on an4 database from run to run without any modifications:

TOTAL Words: 773 Correct: 645 Errors: 139
TOTAL Percent correct = 83.44% Error = 17.98% Accuracy = 82.02%
TOTAL Insertions: 11 Deletions: 17 Substitutions: 111
TOTAL Words: 773 Correct: 633 Errors: 149
TOTAL Percent correct = 81.89% Error = 19.28% Accuracy = 80.72%
TOTAL Insertions: 9 Deletions: 23 Substitutions: 117
TOTAL Words: 773 Correct: 639 Errors: 142
TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 8 Deletions: 19 Substitutions: 115
TOTAL Words: 773 Correct: 650 Errors: 133
TOTAL Percent correct = 84.09% Error = 17.21% Accuracy = 82.79%
TOTAL Insertions: 10 Deletions: 17 Substitutions: 106
TOTAL Words: 773 Correct: 639 Errors: 142
TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 8 Deletions: 19 Substitutions: 115

If you are lucky you can even get WER of 15.95%. Thats certainly unacceptable and it still remains true why training is so sensible to dither applied. Clearly it makes any testing impossible. I checked this results on medium vocabulary 50-hours database and they are still the same - the accuracy is very different from run to run. Interesting thing is only training is affected that much. For testing you can get very slight difference of 0.1%.

So far my solutions are:
  • Disable dither on training
  • Apply a patch to drop frames with zero energy (this seems useless but it helps to be less nervious about warnings)
  • Decode with dither
I hope I'll be able to provide more information in the future about the reasons of this unstability, but for now it's all I know.

Text summarization low hanging fruit

Actually all the data required for quite precise text summarization is almost in place, one should just add support for WordNet from nltk into the Open Text Summarizer, calculate frequencies and present highlighted sentences to the user. Or it's possible to do the same in python with nltk iteself.

It would help in many cases for example in mail processing. Getting 200 mails in a day it's really hard to read them through. Or probably it's just time to unsubscribe from some mailing lists.

My vote for voxforge

I just voted for VoxForge in the category: "Most Likely to Change the Way You Do Everything". You might want to do the same :)

Go to http://sourceforge.net/community/cca09/nominate/?project_name=VoxForge&project_url=http://voxforge.org/

Blizzard Challenge 2009

CSTR and others are pleased to announce that the listening tests for the Blizzard Challenge 2009 are now running. The Blizzard Challenge is an annual open speech synthesis evaluation in which participants build voices using common data, and a large listening test is used to compare them. Participants include some of the leading commercial and academic research groups in the field.

I would appreciate your help in getting as many listeners to participate as possible, by forwarding this message on to other lists, colleagues, students, and of course taking part yourself.

The listening test should take 30-60 minutes to complete, and can be done in stages if you wish. You do not need to be a native speaker of the language in order to take part. There are 4 different start pages for the listening test, as follows:



Speech Experts:

Mandarin Chinese:


Speech Experts:

Whether you consider yourself a 'speech expert' is left to your own

Training the large database trick

Training of the large database requires a cluster. SphinxTrain supports training on Torque:PBS for example, to do this you need to set the following configuration variables:


and set the number of parts to train. The issue is to guess the number of parts. I previously thought

1 part:

TOTAL Words: 773 Correct: 660 Errors: 126
TOTAL Percent correct = 85.38% Error = 16.30% Accuracy = 83.70%
TOTAL Insertions: 13 Deletions: 9 Substitutions: 104

3 parts:

TOTAL Words: 773 Correct: 583 Errors: 262
TOTAL Percent correct = 75.42% Error = 33.89% Accuracy = 66.11%
TOTAL Insertions: 72 Deletions: 17 Substitutions: 173

10 parts:

TOTAL Words: 773 Correct: 633 Errors: 168
TOTAL Percent correct = 81.89% Error = 21.73% Accuracy = 78.27%
TOTAL Insertions: 28 Deletions: 10 Substitutions: 130

20 parts:

TOTAL Words: 773 Correct: 619 Errors: 181
TOTAL Percent correct = 80.08% Error = 23.42% Accuracy = 76.58%
TOTAL Insertions: 27 Deletions: 13 Substitutions: 141

But it appeared that all above is not true. One potential source of problems was that the norm.pl scripts grabs all the sub directories under the bwaccum one indiscriminately. So if there are some old bwaccum dirs left over (e.g. if you train on 20 parts first then start again with 10, without deleting the directories in-between), the norm script will screw up (thanks to David Huggins-Daines for pointing that out to me). In this particular test there was another one that I forgot to update mdef after model rebuild and old scripts didn't do that automatically. On multipart the order of senones in mdef is different thats why there was a regression. Though the set of senones is the same.

So the testing and statements above are completely wrong - accuracy doesn't depend on number of parts used. As expected. This confirms the ground truth that correct experiment statement is the most important thing in research.

Now only one issue left - the dropped accuracy from old tutorial to a new one. But that is a completely different issue discussed in my mails on cmusphinx-sdmeet now.

Bad prompts issue

After quite a lot of training of the model on a small part of database to test things I came to conclusion that the main issue is a bad prompts. Indeed the accuracy on the training set for 4 hours of data with the language model trained on the same training prompts is only 85%. Usually it should be around 93%. The issue here is that real testing prompts are also bad and they should stay that way, otherwise we'll be bounded to high quality speech only. I remember I tried a forced alignment with communicator model before but it didn't improve much just because of the testing set issue. Another try was to use skip state, that was not fruitful as well.

So the plan for now is to choose the subset with the forced alignment again and train the model to check if the hypothesis is true and bad prompts in an acoustic database is indeed a main issue. It looks like we are walking around by the circle.

I ended reading the article titled "Lightly supervised model training"

Speech AdBlock

Inspired by Daniel's Holth application to remove word "twitter" from podcasts:


I think it's a very good idea to implement keyword filter to block advertizing in podcasts. Though support for keyword spotting is not easily implemented with CMUsphinx right now, it should be rather straightforward thing to do. In the end it can be just a binary application that takes a list of keywords to block and just filters mp3 file giving user the same file with blocked advertising.

Cepwin Features Training

Recently the option to bypass delta and delta-delta feature extraction process and directly apply LDA transform matrix to the cepstrum coefficients of sequential frames was added to sphinxtrain. To use it you need to adjust training config and decoder as well:

  • Set feature type to 1s_c
  • Add $CFG_FEAT_WINDOW=3; to the config file
  • Train with MLLT
  • Apply the attached patch to sphinxbase cepwin.diff.
  • Decode

  • You can use these models in sphinx4 now, the following config should do the work:

    <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.ConcatFeatureExtractor">
    <property name="windowSize" value="3"/>
    <component name="lda" type="edu.cmu.sphinx.frontend.feature.LDA">
    <property name="loader" value="sphinx3Loader"/>

    I haven't found the optimal parametrers yet, but it seems that something like cepwin=3 and final dimension around 40 should work. I hope to get results on this soon.

    Looking back on Free Software

    I've read some books on business recently:
    They sometimes repeat each other but actually have few interesting moments. At least I started to look on all this from a bit different point of view. Unfortunatly this domain is covered differently by Free Software community people who tend to be idealistic but promote their point of view actively. The words like "community" or "leadership" or "cool people" don't bring much in the end, and the most interesting thing is that such words are mosly spoken by corporate people.

    Anyhow, it would be nice to have a project that will have a clear mission and a set of reachable goals, like product plans each one with a design both technical and non-technical documents. It would be nice to have a test set with 90% coverage and a build without warnings and also a tracking system for user requests. Thing like slick UI are also important. After all, it's easier to get this than to build an LVCSR with 95% accuracy I think :)

    Frama-C Eclipse plugin

    I decided to finally go forward and publish my modifications of frama-c Eclipse plugin I'm doing at work. Moreover I decided to try git/github. Let's see how it goes. The project is here:


    the future plans include
    • better graphics
    • more cleanup
    • offshelf support for recent Frama-C versions

    Quest in configuration file

    After almost a year of wondering I finally discovered what does this mean in sphinx4 config files:

    <component name="activeListManager"
    <propertylist name="activeListFactories">

    Actually it's even described in docs:

    The SimpleActiveListManager is of class edu.cmu.sphinx.decoder.search.SimpleActiveListManager. Since te word-pruning search manager performs pruning on different search state types separately, we need a different active list for each state type. Therefore, you see different active list factories being listed in the SimpleActiveListManager, one for each type. So how do we know which active list factory is for which state type? It depends on the 'search order' as returned by the search graph (which in this case is generated by the LexTreeLinguist). The search state order and active list factory used here are:

    State TypeActiveListFactory

    There are two types of active list factories used here, the standard and the word. If you look at the 'frequently tuned properties' above, you will find that the word active list has a much smaller beam size than the standard active list. The beam size for the word active list is set by 'absoluteWordBeamWidth' and 'relativeWordBeamWidth', while the beam size for the standard active list is set by 'absoluteBeamWidth' and 'relativeBeamWidth'. The SimpleActiveListManager allows us to control the beam size of different types of states.

    It's hard to guess, isn't it? Well, I hope soon we'll be able to make configuration easier. The idea of annotatated configuration came to my mind today. With the older idea of using task-oriented predefined configurations it could really save a lot of efforts.

    Today with the help of my chief I've found FindBugs, a nice static analyzer that tries to find the issues in Java code and reports about them. It's a very useful tool since I've already fixed a few bad things in sphinx4 and in other projects. The number of false positives is acceptable. The similar tool for C for example is splint, though java tools as usual are much more useful. And there is an Eclipse plugin that helps to apply the tool with a single mouse click.

    This makes me think about what can be counted as a development platform. Although it's well known that scripting languages like Python speedup the development, they totally lack the tools like static analyzers, debuggers, profilers, documentation and testing frameworks and so on and so forth. There is some effort to create a common framework to quickly build development tools along with DSL language, but the result is not so advanced I suppose. Basically it seems today there is no choise which language to use for the development and in the light of this it seems very strange that GNOME development goes in completely opposite direction stepping to the domain of JavaScript and naive programming. I hope the desktop will not become a collection of bugs after that.

    Sphinx4 migrated to git

    This change started some time ago, but now it's mostly finished and announced. The tree could be found here:


    The discussion is here.

    I'm glad to see the progress happens, big thanks to everyone involved - Joe, Piter and others.

    About git itself I have a mixed feeling. The advantages of DVCS aren't obvious for me and in the past I even gave up my participation in one of the projects after it's migration to mercurial (it was http://linuxtv.org). Distributed nature increases complexity and confuses at least me. It's hard to understand where the latest changes are done, what is the real state of thing and where change happens. Developers tend to add their changes to their own branches and little effort is made to create a common branch. Also among all DVCSs git is the worst in terms of usability. Sadly GNOME also migrates to git in near future.

    Every change has it black and white sides. Many things I do like in a new sphinx4 - clear split of the tests one can run. Some things are hard to understand like Rakefile migration. I'm afraid of windows users, how will they build sphinx4 now? Anyhow, let's hope issues will be resolved and the new shiny release will appear very soon.

    Russian GNOME 2.26

    Russian GNOME 2.26 is 100% translated. Congratulations to the team for their hard work.

    GNOME Summer of code tasks

    I spent some time today trying to invent some interesting tasks for GNOME summer of code 2009. My favorite list for now is:
    • Text summarizer in Epiphany
    • Improved spell check for GEdit
    • Doxygen support for gtk-doc
    • Desktop-wide services for activity registration
    • Automatic workstation mode detection and more AI tasks desktop can benefit from
    • Cleanup of the Evolution interface where sent and received mail are grouped together
    The list is probably too boring, but one should note that usually summer is to small to implement something serious and students are not that experienced as one want to see them. Some of the tasks were rejected already, though it's not a big deal. I just find discouraging that the list of the tasks proposed officially is even more tedious.

    The overview of this issue makes me think again about GNOME as a product on the market and the possible ways of it's development. It seems that we are now at a point when feature set among competitors are stabilized and it's hard to invent something else in a market. So-called mature product stage where it's important to polish and lower costs. The big step is required to shift product on a new level. Probably I need to investigate the research desktops that completely change the way users works with the system. For example I'd love to see better AI support everywhere like adaptive preferences, better stability and security with proper IPC and service-based architecture, the self-awareness services, the modern programming language. I'm not sure I'm brave enough for that though.

    HTK 3.4.1 is released

    Amazing news really. The new features of the release include:

    1. The HTK Book has been extended to include tutorial sections on HDecode and discriminative training. An initial description of the theory and options for discriminative training has also been added.
    2. HDecode has been extended to support decoding with trigram language models.
    3. Lattice generation with HDecode has been improved to yield a greater lattice density.
    4. HVite now supports model-marking of lattices.
    5. Issues with HERest using single-pass retraining with HLDA and other input transforms have been resolved.
    6. Many other smaller changes and bug fixes have been integrated.

    The release is available on the HTK website

    Building interpolated language model

    Yo, with some little pocking around I managed to get a combined model from the database prompts and a generic model. The accuracy is jumped significantly.

    Sadly cmuclmtk requires a lot of magic passes with the models to get lm_combine work. Many thanks to Bayle Shanks from voicekey to write a receipt. So if you want to give it a try:

    • Download voice-keyboard
    • Unpack it
    • Train both language models
    • Process them with the scripts lm_combine_workaround
    • Process both with lm_fix_ngram_counts
    • Create a weight file like this (the weights could be different of course):
    first_model 0.5

    second_model 0.5
    • Combine models with lm_combine.
    After all the steps you can enjoy the good language model suitable for dialog transcription.

    Building the language model for dialogs

    I'm in search how to build a combined language model suitable for dialog decoding. I have quite a lot of dialog transcriptions, but they aren't comparable with generic model built from the large corpora from the view of the coverage. It would be nice to combine them somehow to get the structure of the first model and the diversity of the second one. In one article I read it's possible just to interpolate them lineary. So probably I just need to get closer in touch with SRILM toolkit

    It's discouraging that sphinx4 doesn't support high-order n-grams. Another article mentions a solution for that to join some often word combinations into compound words.

    Btw, generic model gives 40% accuracy while home-groun dialog model gives 60, so it's a promising direction anyhow.

    Cleanup strategies for acoustic models

    An interesting discussion goes on Voxforge about the cleanup of the acoustic database. It seems for me that we really different from the usual research acoustic models which are mostly properly balanced. We have a load of unchecked contributions with non-native accents and so on. But we still have to work with such database and get sensible models from them. Fisher experience showed that even loosely checked data could be useful for training. Although we aren't that nicely transcribed as Fisher, we still can be useful only if we'll apply specific training method that assume the nature of the data collected.

    I tried to find some articles about training of the acoustic model on the uncomplete data, but it seems that most of such research is devoted to another domains like web classification. Web data by definition is incomplete and has errors. We could reuse their methods on unsupervised learning, but I failed to find information on this. Links are welcome.

    Another interesting reading I had today is the performance of the Fisher database. Articles mention that the baseline is around 22% WER on 20xRT speed. 20xRT is unacceptably slow I think, but even with 5xRT we are close to this barrier. The thing that makes me wonder is that in sphinx4 beams make decoding slow but doesn't improve accuracy. It must be a bug I think.

    Nexiwave in MIT100k

    Congratulations to Ben and others seeing Nexiwave in MIT100k semifinal.

    The great article about architecture management:

    IEEE Computer society. March/April 2005 (Vol. 22, No. 2) pp. 19-27 Architecture Decisions: Demystifying Architecture

    Behaviour guideline

    Although GNOME HIG nicely specifies, how GNOME application should look, it's says nothing about how application should behave. It's strange that our usability guys don't take such an important thing into account. Even small difference in program execution affects user satisfaction. For example, Open dialog saves location in one applications but makes me browse to the same directory every time in another. The main problem we are talking in that section is that programs should behave consistently, not only look consistently.

    And some consequences.

    It's impossible to get consistent behaviour without sharing codebase. One can make application written in Qt or with Mozilla suite look like Gtk application, but user easily see the difference way such applications do things. Once you open settings dialog, mirage of consistensy dissapear. HIG should not be recomendations everyone is trying to follow, but a documentation to hardcoded rules, that anyone using library is automatically following.

    Integration with other toolkits doesn't have any sense, as supporting software on different platform. If application uses another codebase it will behave differently. Take, for example, gecko or gtk-mozembedd applications. They all have problems with keyboard focus and accessibility. It's impossible to make them work like GNOME user expected. Even if you'll get them look similar, it's impossible to maintain such consistensy every time something changes in gtk.

    Digging the acoustic model bug

    It's so useful to sleep a lot and care about the details. Today I've found the bug in the model I think I could search for years. Accidentally the topology of the model was wrong and I was using 5-state hmm without the transition from state number n to state n+2. The result was quite bad recognition accuracy. This issue was so well hidden I wonder if it's possible to discover such things at all.

    The method that helped is related to comparative debugging. By checking the performance of pocketsphinx and sphinx4 I found they differ. But it wasn't enough. The critical point that helped is the unrelated change in pocketsphinx. Previously it assigned very small non-zero transition probabilities even if model had zero transitions. Happily, David changed that about a month ago. And the difference in recognition rate between recent and older version of pocketsphinx helped to find the problem. I was really lucky today.

    I wonder how diagnostics could be extended here to help to resolve issues like that. It seems for me extremely important to build a recognition system that allows verification. The similar problem came today on another front btw. Our router had issue in DNS relay, but it was almost impossible to discover the reason due to limited diagnostic output. We really need to rethink the current way of error reporting.

    Release of msu_ru_nsh_clunits

    Don't want to be boring and make this feed a clone of sourceforge, but I've recently released the new Russian voice for Festival.


    The new release contains a lot of previous updates that were distributed unofficially. And the most important feature is that labels were updated automatically with SphinxTrain / MLLT models, that improved the performance a bit. I haven't debug the current problems though, so it's not clear what are the next steps to improve the quality. Though it's rather clear that better join algorithms and HMM-based cost functions will improve accuracy. Also I wanted to look on pending transcription algorithm for Russian used in academic Russian synthesizers.

    I finally followed the industry mainstream and denied the hand-made labels. It's much easier to keep them automatic for sure, because it brings more flexibility. I don't believe in hand work that appeared to be error-prone as well. Hope that better algorithms like Minimum Segmentation Error training will do their work on creation of the perfect segmentation for TTS database. Also I wanted to think about for processing algorithms that are robust to segmentation errors. They are more reasonable to apply in situation when errors are present by design.

    I shifted my day schedule again to US time, which is not that perfect. It used to return back last week when I had to stay awake whole night and day. It finally appeared to be productive, but now it shifted back again. I hope, I'll be able to return it back soon.

    Sphinxtrain Release

    It seems this will be a month full of releases. Sphinxtrain was released today.

    SphinxTrain is the acoustic model training system for the Sphinx family of continuous speech recognition systems.After years of not having an actual release of SphinxTrain, it was time to make one, in anticipation of potentially restructuring the training code. This trainer can produce acoustic models for all versions of Sphinx, and supports VTLN, speaker adaptation and dimensionality reduction.Future releases will support discriminative and speaker-adaptive training, and will be more closely integrated with the Sphinx decoders.

    Asterisk and pocketsphinx

    Today I'm happy to see long-waited proper support for Asterisk Speech API for pocketsphinx. Now, taking into account the Freeswitch support, we provide competive support for speech recognition applications on two popular PBX platform.

    Get the details here

    New release of sphinx4

    I'm really happy to announce the new release of sphinx4, the speech recognition engine written in Java. I wouldn't say it is the greatest release ever, but it's supposed to breath the life into this project that has quite a long history. A huge number of developers and companies were involved in the development of this product and many producs rely on it, but, unfortunately the project itself is kind of stalled. This new release mostly fixes what was done over last 5! years from the previous release and the major features are:
    • Rewritten XML-based configuration system
    • Frontend improvements like better VAD or MLLT transformation matrix support
    • Cleanup of the properties
    • Batch processor with accuracy estimation for testing
    The project inself still clearly lacks well organized documentation, better usability but I think it's subject for a next release due in half a year or so. I really want to see it as a more reliable platform for speech recognition development and hope that new features, active community support, regular releases and stable API will make their work.

    Blog Archive