Mary TTS 4.3.0 released

With Russian voice from Voxforge DB. Yay! Try it on the web:

Mary is definitely superious system comparing to Festival. Graphical UI, modern design, support for various things like automatic dictionary creation make it really easy to build a language support. And due to modular and stable codebase one can easily add support for new feature, integrate with external package like it's done with OpenNLP or just fix the bug. And your fix will be accepted!

There are two Voxforge TTS datasets pending between - German and Dutch and also there is a Polish voice. If anyone wants to try that, it must be really easy to add other language to Mary.

Phoneset with stress

So I finally finished testing of the stress-aware model. It took me few month and the end I could say that lexical stress is definitely better. It provides better accuracy and, more importantly, more robustness over model which has non-stressed phoneset.

I hope we retrain all other models we have with the phoneset with stress. It's great that CMUDict provides enough information to do that. The story of me testing that was quite interesting. I believed in stress for a long time but wasn't able to prove that. In theory it's clear why it helps, when speech speed changes, stressed syllables remain less corrupted than unstressed and we get better control over data. Additional information like lexical stress is important. Of course the issue is the increased number of parameters to train the model. That's why I think early investigations concluded that phoneset without stress is better. Discussion about it on cmusphinx-devel this summer also confirmed Nuance moved to the model with stress in their automotive decoder.

It's interesting how long I tested that. I made numerous attempts and each one had bugs

  • First attempt was using bad features (adapted for 3gp) and didn't show any improvement
  • Number of senones in second training was too small since I didn't know the reason of first failure
  • Third attempt had issue with the automatic questions which were used accidentally instead of manual ones I wrote and it went unnoticed
  • Fourth attempt was rejected because there were issues with the dictionary format in Sphinx4. Never use FastDictionary between, use FullDictionary. Fast dictionary expects specific dictionary format with variants like (2) (3) (4) and not (1) or (2) and (4).
  • Only fifth attempt was good but in shown improvement only on big test set and not on the small one

So basically to check every fact you need to be very careful and double- or triple-check everything. Bugs are everywhere, in language model training, decoder, trainer, configuration. From run to run bugs could lead to different results, even a small change can break everything. I think optimal way for research could be to check the same proposition in independent teams using independent decoders and probably different data. Not sure if it's doable in short term.


I've rebuilt the Nexiwave langauge models and meet some issues which would be nice to solve one day. CMU language model tookit is a nice simple piece of software but it definitely lacks many features which are required to build a good language model. So thinking about features language modelling toolkit can provide I created a list.

Decoding of Compressed Low-Bitrate Speech

I've spent some time on optimizing accuracy for 3gp speech recordings from mobile phones. 3gp is a container format used on most mobile devices nowdays with speech compressed using AMR-NB inside. Converted audio to AMR-NB and back, extracted PLP features and then trained few models on that. Result is not encouraging - accuracy is worse than stock model both on original and on compressed/decompressed audio. Not much worse but significanly worse.

Looks like traditional HMM issues like frame independency assumption play here which is confirmed by the papers I found. This paper is quite useful for example:

Vladimir Fabregas Surigué de Alencar and Abraham Alcaim. On the Performance of ITU-T G.723.1 and AMR-NB Codecs for Large Vocabulary Distributed Speech Recognition in Brazilian Portuguese

And this paper is good too:

Patrick Bauer, David Scheler, Tim Fingscheidt. WTIMIT: The TIMIT Speech Corpus Transmitted Over the 3G AMR Wideband Mobile Network

Need to research more on subject. Suprisingly there are only few papers on the subject, way less than on reverberation. It looks we have to build specialized frontend specifically targetted on decoding of low-bitrate compressed speech. Or we need to move to more robust features than PLP.

For now I would state the problem to develop a speech recognition framework to provide good accuracy on:
  • Unmodified speech
  • Noise-corrupted speech
  • Music-corrupted speech
  • Codec-corrupted speech
  • Long-distance speech
Good system should decode well in all cases.

Updates in SphinxTrain

Being tired to explain build issues over and over I found the passion to step in and start a sequence of major changes in SphinxTrain

  • Ported sphinxtrain to automake, development branch you can try is here:
  • Will increase SphinxTrain dependency on sphinxbase, unifying the duplicated sources.
  • Will make training use external SphinxTrain installation, no setup in training folder will be required, only configuration. All scripts will be in share and in libdir, they will be installed systemwide. To try a new version one will just need to change path to sphinxtrain.
  • Will modify scripts to be able to build and test the database using a single command. No possibility to miss anything!
  • Will include automation for language weight optimization on a development set, better model training scripts will do everything required.

I know Autotools aren't the best build system, but they are pretty straghtforward. More importantly, the tools will follow common Unix practices and thus will be easier to use and understand.

Comments are welcome!


We've done a great progress on Nexiwave also. Check it out!

Backward Compatibility Issues

Just today I spent few hours trying to figure out why changed makeinfo version output broke binutils build. Well, it's an old bug but we all getting mad when backward compatibility breaks. Especially when it affects our software. Especially when we don't have time no passion to fix that. My complains raised to the roof or probably even higher.

Life is a strange thing. Right after that I went ahead a broke sphinx4 backward compatibility in model packaging (again!). Now models distributed with sphinx4 follow Sphinxtrain output format, all files are in the single folder, model definition is named simply "mdef" and there is feat.params. Things are very

[shmyrev@gnome sphinx4]$ ls models/acoustic/wsj

dict license.terms means noisedict transition_matrices
feat.params mdef mixture_weights README variances

It will certainly help to avoid confusion when new developers change the model, adapt the model or train their own one.

In the future I hope to get feat.params used better in order to automatically build frontend, derive feature extraction properties, hold metadata about model and similar things. Shiny future is getting closer.

I also removed RM1 model from the distribution. I don't think anybody is using it.

So please don't complain, let's better fix that until it's too late to fix. One day we'll get everything in place and we'll release final version sphinx4-1.0. And after that we'll certainly be backward-compatible. I really like Java and Windows because of their long-term backward-compatible policy. We can do even better.

Speech Like WWW

Talking about custom speech application development I've got a thought. There are quite many speech companies already. Speech application development is actually quite similar to UI design or to web design in sense that you need to be specialized expert in order to create speech interface. What if speech developers will be like web designers - thousands of them every day build customized websites all over the world? What if market is so huge that it will be possible to run hundred shops each working on customer needs.

To be honest I don't quite like web development. It sounds very strange for me that you MUST pay at least $1000 to build something that is pleasantly looking. And for big websites its way more. Whoever created this market didn't think about business, he designed HTML in order to drain money from small and big companies. I tried to create few websites myself, for example CMUSphinx website. Even with all modern tools, CMS platforms, themes and stuff the reality is that you need to be an expert. Otherwise the result will not be satisfactory enough. Menus will overlap, regions will not be aligned, pictures will be blurry and colors will not match. Can it be different? Certainly it can, but not in this world. I can understand that creativity can't be automated, I can't understand that creativity is required for every company.

There are some similar things in software development like for example if you want to develop a telco app you probably want to hire Asterisk developer. And there are thousands Asterisk developers out there. Or if you want JBoss you could find JBoss experts. But I think if you know how to develop applications you can configure Asterisk properly and you can create a bean to acess the database.

Are we interested in creation of huge and diverse speech industry? Can CMUSphinx be the basement for it? No definite answer for now.

Recent issues

Heh, this month I discovered few critical issues in CMUSphinx.
  1. Pocketsphinx doesn't properly decode short silences in FSG/JSGF mode
  2. Sphinx4 doesn't really work with OOV loop in grammar
  3. Pocketsphinx n-best lists are useless because of too many repeated entries
  4. Pocketsphinx accuracy is way lower than sphinx3 one
  5. Supposedly-working sphinxbase LM stuff doesn't work with 32-bit DMP, thus no MMIE training for very large vocabulary
  6. MMIE itself doesn't improve accuracy (tested on Voxforge and Fisher)
  7. It's impossible to extract mixture_weights from recent sendumps in pocketsphinx models, python scripts in SphinxTrain are outdated
  8. PTM model adaptation doesn't work
  9. TextAligner demo from sphinx4 requires way more work to align properly

That's getting crazy, I wonder if I'll be able to find the time to fix all that.

Do You Want To Talk To Your Computer?

Thanks everyone who voted at

To be honest I was suprised because my opinion is just the reverse of this result. I strongly disagree that command and control will be ever usable. Dictation probably will, but definitely not command and control. I have a hobby - collecting complains about voice control. Here are few ones

As article in PCWorld says

If so, you'll love what Microsoft is offering: voice recognition over the air, in which your commands are processed by a server in the clouds and converted into action on your smartphone. Boy, let's burn up those minutes and data plans! And waaait for the slow, usually incorrect response. Android has a similar capability for search, and it's amazingly frustrating to use, not to mention inaccurate.

The one good thing about Microsoft's fantasy about voice-command interfaces: You'll be able to identify a Windows Phone 7 user easily. Just listen for the person pleading with the phone to do what he asked. Whie the rest of us are quietly computing and communicating, he'll be hard to miss.

Another post from CNET

One of the major reasons why speech recognition has not caught on or been seriously looked at in terms of major finances is because people, if given the option of accurate speech recognition too, would still not wanna go for the voice commands, but would rather just use touchscreen. This is because voicing out takes more energy off a person than smoothly running yer fingers on the screen in your hand. Intuitive touchscreens and cleaner interfaces are far better a tool to invest in than making people accustomed to say out words that computers need to understand, process and then implement. Its way easier for the user (in terms of energy used to say it out) to just press the button on touchscreen. There will be certain exceptions, but I'm talking on a mass consumer adoption assumption.

I strongly believe that when we want to communicate with computer, there are better ways than to give them voice commands. Yes, speech is a natural way of communication but it's a communication between people. When you communicate with machine you don't necessary need to speak to it, there are more efficient ways. Even if you are driving.

On the other side I think that analytics, speech mining and similar stuff do have a very shiny future. According to DMG consulting the growth of this market will be 42 percent in 2011. That's a true potential. Speech recognition should seemlessly plug into comunication between people and extract value from it. Being non-intrusive it doesn't break patterns but helps to create the information. That's why we invest so much into mining and not into command and control. That's also the reason I don't want to invest too much time in gnome-voice-control.

No More Word Error Rate


What he said:

Hi Brad, it's Mike. I had a lunchtime appointment go long and I am bolting back to Evans. I'll be there shortly. See you soon. Thanks.

What Google Voice heard:

That it's mike. I had a list of women go a long and I am old thing. Back evidence. I'll be there for me to you soon. Thanks.

The interesting thing is that it got 17 out of the 26 words right--but those 17 words convey almost none of the information in the message...

I found this paper

Is Word Error Rate a Good Indicator for Spoken Language Understanding Accuracy
Ye-Yi Wang and Alex Acero

It is a conventional wisdom in the speech community that better speech recognition accuracy is a good indicator for better spoken language understanding accuracy, given a fixed understanding component. The findings in this work reveal that this is not always the case. More important than word error rate reduction, the language model for recognition should be trained to match the optimization objective for understanding. In this work, we applied a spoken language understanding model as the language model in speech recognition. The model was obtained with an example-based learning algorithm that optimized the understanding accuracy. Although the speech recognition word error rate is 46% higher than the trigram model, the overall slot understanding error can be reduced by as much as 17%.

We definitely need to address it in sphinx4.

Reading Interspeech 2010 Program

Luckily speech people don't have so many conferences. In machine learning world it seems it's getting crazy. You can have conference every month. Researchers travel more than sales managers. In speech there are ASRU, ICASSP but they don't really matter. It's enough to track Interspeech. Since Tokio is too far, I'm just reading the abstract list from the program. First impressions are:

  • Keynotes are all boring
  • Interesting rise of the subject "automatic error detection in unit selection". At least three! papers are presented on the subject while I haven't seen any of them before. Looks like idea appeared in less then a year! Are they spying each other?
  • RWTH Aachen presented enormous amount of papers, LIUM is also quite fruitful
  • Well, IBM T. J. Watson Research Center is active as well, but thats more a tradition
  • I've met in one paper: "yields modest gains on the English broadcast news RT-04 task, reducing the word error rate from 14.6% to 14.4%" Was it worth writing an article?
  • Cognitive status assessment from speech is important in dialogs. SRI is doing that
  • Strange that reverberation issues are a separate class of problems to solve and largely covered.The problem as a whole looks rather generic - create noise and corruption-stable features. Not sure how reverberation is special here
  • WFST is loudly mentioned
  • Andreas Stolke on SRILM noted that pruning doesn't work with KN-smoothed model! Damn, I was using it
  • Only 2 Russian papers on the whole conference. Well, it's 50% growth to previous year. And one of them is on speech recognition, that's definitely a progress
  • Suprisingly not so much research on confidence measures! Confidence is a REALLY IMPORTANT THING

Reading the abstracts I also selected some papers which could be interesting for Nexiwave. Probably you'll find this list easier to read than 200 papers from original program. Let's hope this list will be useful for me as well. To be honest I didn't manage to read the papers I selected previous year from Interspeech 2009.

Voicemail transcription with Pocketsphinx and Asterisk

This is for admins who are aware that pocketsphinx exists and want to try it. It will describe how to quickly setup voicemail transcription using pocketsphinx and Asterisk. The process is extremely simple, I promise it will not take more than 5 minutes.

We'll use external shell command invoked when voicemail arrives and this command will transcribe voicemails which aren't transcribed yet. We won't postprocess the result as well as will not clean it up. The goal is just to show how to do it quickly and show how asterisk interface can be built.

Next Routine Release

So today we released Sphinx4-Beta5. I'm not very pleased by the number of features that went in, I wanted to do more, but I'm very pleased by the increasing number of people contributed to this release. From release to release thanks list is definitely getting longer.

This six month I finally went in the deep blue area of the LexTree linguist. So the major feature of this release is significantly reworked LVCSR search which is supposed to get faster at least sometimes. Careful testing would be welcome here.

Another important but not so visible thing is the first bit of the applicaiton-oriented public API. There will be no public XML configs anymore! The Aligner demo that comes with sphinx4 is the first step towards that. The configs for the typical applications will be stored inside jar and supposed to be used by the developer who want to tweak engine. No demo will use XML anymore. This rework of the public API together with opimization of the data flow inside the engine is the top priority for us for the next six month. We shall discuss in Feburary how it goes.

Patents in ASR

I was pleased to read this review on Nuance vs Vlingo litigation.

Based on this analysis, we can find no evidence supporting the “seven figure” price which Vlingo paid to acquire the ‘354 patent from Intellectual Ventures. It is unclear how, armed with a broad spectrum of information, Vlingo would have justified their purchase. The implications of the Vlingo v. Nuance litigation hold tremendous implications for corporate IP strategies.

Senone Tree Implementation For Sphinx4

I spent last month working on senone tree linguist for sphinx4 as a part of Nexiwave's sphinx4 performance project. Well, mostly I was fixing bugs in my initial implementation. The core idea of senone tree which was suggested to me by Bhiksha is the following. Lextree is a representation of all possible words in a dictionary which is built with triphones. Lextree is used to explore search space during decoding. There very good thing is that since number of HMMs is rather small comparing to the number of triphones (40000 vs 100000) the lextree is rather compact representation of the search space.

Speech Decoding Engines, Part 2. SCARF, The Next Big Thing In Machine Learning

It seems that HMM will not stay forever. If you aren't tied to speech and track big things in machine learning, you should hear about that new thing - Conditional Random Fields. According to recently started but very promising Metaoptimize, it's one of the most influental ideas in machine learning.

And, suprisingly, you can already apply this thing to speech recognition, thanks to Microsoft Research including Geoffrey Zweig, Patrick Nguyen. It's SCARF, a Segmental Conditional Random Field Speech Recognition Toolkit which is version 0.5 now. You can download it's sources from Microsoft Research Website.

Testing CMUSphinx with Hudson

As every high-quality product CMUSphinx spend a lot on testing. That isn't really trivial task because you need to make sure that all parameters that are important are improved or at least not regressed. That includes decoding accuracy, speed and API specs. Sometimes changes improve one thing and make other worse. Things are going to change with the deployment of continuous integration system Hudson.

Quite sophisticated system of tests was created to track changes. That included perl scripts, various shell bits, mysql database and even commits to CVS repository. It was also spamming mailing list all the day with long and unreadable emails. Another bad thing was that it's based on private commercial data like WSJ or TIDIGITS database but now everything is changing with Voxforge test set. Our goal is to let you test and optimize system yourself

HTK Competition Voting

Thanks everyone for your feedback, results are really interesting to see.

HTK competition is something that I was worrying about for a long time. One key issue that I see is that htk-users mailing list definitely has way more deep discussions about ASR than we have on our forum. Hopefully, situatuation will change.

Anyway, our goal is still to provide very accurate speech recognition and this is not yet solved task with many issues both in usability and accuracy. So we can definitely learn from each other and improve our projects.

Sphinx4 Powers Contemporary Art

Did you think that sphinx4 could be only used to build another keyboard, help you to track sales manager blaming the product or transcribe medical dictation? Working with computers on daily basis one starts to consider them as a tool.  I was thinking this way not taking into account the fact that speech act itself powered by computers has probably sacral meaning. Communication was the thing that created our mind, and keyboards aren't important when we create communication systems.

The thing that pushed me to this is Heather Dewey-Hagborg's blog. In particular it was the Listening Post, an artistic thing from the CEPA gallery. If you are interested, please also check Heather's interview on BTR Radio. Check also the gallery's site.

And important point here is that I think we should not consider this as some kind of futurizm - talking computers, HAL and all that stuff. Instead, such things help us to change ourselves, change our vision of the world around. Probably next time you'll look on sphinx4 sources from a different point of view.

Great Overview Article

Today Dr. Tony Robinson gave me a present by mentioning this great article on comp.speech.research

Janet M. Baker, Li Deng,
James Glass, Sanjeev Khudanpur,
Chin-Hui Lee, Nelson Morgan, and
Douglas O’Shaughnessy

Research Developments and Directions in Speech Recognition and Understanding, Part 1
Research Developments and Directions in Speech Recognition and Understanding, Part 2

This article was MINDS 2006–2007 Report of the Speech Understanding Working Group,” one of five reports emanating from two workshops titled “Meeting of the MINDS: Future Directions for Human Language Technology,” sponsored by the U.S. Disruptive Technology Office (DTO). For me it was striking that spontaneous events are so important, I never thought about them from this point of view.

The whole state of things is also nicely described in Mark Gales talk Acoustic Modelling for Speech Recognition: Hidden Markov from Models and Beyond? The picture on the left is taken from it.

Blizzard Challenge 2010

Since I was in TTS for a long time and still interested in in, I've been waiting a long for this - Blizzard Challenge team is ready to accept speech expert and volunteer listeners for the Blizzard Challenge 2010.

The challenge was devised in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.

After evaluation participants submit papers where they describe the methods used and problems solved. You could find more information on the webpage

KISS Principle

Still think that you can take sphinx4 engine and make a state-of-art recognizer? Check what AMI RT-09 entry is doing for meeting transcription in presentation on RT'09 workshop "The AMI RT’09 STT and SASTT Systems":

  1. Segmentation
  2. Initial decoding of full meeting with

    • 4g LM based on 50K vocabulary and weak acoustic model (ML) M1
    • 7g LM based on 6K vocabulary and strong acoustic model (MPE) M2
  3. Intersect output and adapt (CMLLR)
  4. Decode using M2 models and 4gLM on 50k vocabulary
  5. Compute VTLN/SBN/fMPE
  6. Adapt SBN/fMPE/MPE models M3 using CMLLR
  7. Adapt LCRCBN/fMPE/MPE models M4 using CMLLR and output of previous stage
  8. Generate 4g lattices with adapted M4 models
  9. Rescore using M1 models and CMLLR + MLLR adaptation
  10. Compute Confusion networks
Click on image to check the details of the process.

Campaign For Decoding Performance

We spent some time to make speech recognition backend faster. Ben reports in his blog the results on moving scoring to GPU with CUDA/jCUDA, which reduced scoring time dramatically. That's an improvement we are happy to apply in our production environment.

We consider that GPU is not just a speedup of computation, it's a paradigm shift. Historically search is optimized to make the number of scored tokens smaller since it affected accuracy. Now scoring is immediate, but that means that other parts should be changed. There are few issues to smash on the way:

We really target to make it even more faster, in particular we would really like to solve grow part problem.

Want To Learn How Sphinx4 Works? Help With Wiki!

Long time ago when sphinx4 development was active, the team used twiki hosted at CMU. Unlike many open source projects, this wiki was actually not just a collection of random stuff, but a complete project support system. It contained meetings notes, design decisions and prototyping results. You could find there diagrams and explanations on what is grow skipping or what is skew pruning.

This wiki died some time ago due to administrative issues, but that was even better since the content from it was merged into our current main wiki at sourceforge at sphinx4 namespace. Unfortunately during transition the formatting was lost since dokuwiki formats aren't always the same as twiki ones. That's actually not so bad as well because the content needs to be renewed in order to fit into current state of sphinx4.

So right now there are like 170 pages total about sphinx4. Some of them are useful, some aren't. They definitely contain deep knowledge of sphinx4 internals, something that will probably help you next time you will optimize the performance of large vocabulary recognition with sphinx4. I'm in process of slowly sorting them out but that will take a lot of time. It's your chance to join, help and learn!

Not All Speaker Adaptations Are Equally Useful

Some time ago I was rather encouraged by VTLN which is vocal tract normalization. By so-called frequence warping it tries to unify vocal tract lenght of all speakers and thus make better model. It's done by shifting and adjusting mel filter frequencies. This thing is implemented in Sphinxtrain/Pocketsphinx/Sphinx4. Basically all you need is to enable it in sphinx_train.cfg

$CFG_VTLN = 'yes';

And run the training. It will extract features with all frequency warp parameters with some step (take care about space on disk) and will find out best one with forced alignment of each utterance. Then it will create new fileids and transcription files with reference to the file with proper warp parameter.

To decode with VTLN model you need to guess warp parameter. There are several algorithms suggested to do that. One analyses pitch, others employ GMM for classification. Then you need to reextract features with predicted warp parameter. It gives some visible improvement in performance.

But recently a set of articles like


came into my sight thanks to antonsrv8. The simple idea is that any transform which we do on features, especially smooth transform could be mostly replaced just by linear transform of MFCC coefficients, basically by MLLR transformation. This kind of obvious fact makes me think if we really need other transformations if MLLR is generic enough. It's not harder to estimate MLLR than to estimate warp factor, especially if data is large enough which is usually the case. Another transformation applied will just conflict with MLLR. On large data sets this is confirmed by experimental results in article above.

Of course non-linear transform like VTLN could be better than linear one, but it's certainly not VTLN it seems. I hope latest state of art in voice conversion could suggest something better.

Update: this point was of course largely covered in research papers. Good coverage with math and results is provided in Luis thesis:

Speaker Normalisation and Adaptation in Large Vocabulary Speech Recognition by Lu ́s Felipe Uebel

Recognizer Language Voting Over

So, language voting is over. It seems that despite performance issues we currently face Java gets enough attention. Thanks for sharing your opinion, it's very important for us.

Speech Decoding Engines Part 1. Juicer, the WFST recognizer

ASR today is quite diverse. While in 1998 there was only a HTK package and some inhouse toolkits like CMUSphinx released in 2000, now there are dozen very interesting recognizers released to the public and available under open source licenses. We are starting today the review series about them.

So, the first one is Juicer, the WSFT recognizer from IDIAP, University of Edinburgh and University of Sheffield.

Weighted finite state transducers (WFST) are very popular trend in modern ASR, with very famous addicts like Google, IBM Watson center and so many others. Basically the idea is that you convert everything into same format that allows not just unified representation, but more advanced operations like merging to build a shared search space or reducing to make that search space smaller (with operations like determinization and minimization). Format also provides you interpolation properties, for example you don't need to care about g2p anymore, it's automatically done by transducer. For WFST itself, I found a good tutorial by Mohril, Pereira and Riley "Speech Recognition with Weighted State Transducers".

Juicer can do very efficient decoding with stanard set of ASR tools - ARPA language model (bigram due to memory requirements), dictionary and cross-word triphone models could be trained by HTK. BSD license makes Juicer very attractive. Juicer is part of AMI project that targets meeting transcription, other AMI deliverables are subject for separate posts though.

Face Recognizers, Bloom filters and Application to Speech Recognition

In scientific paper waterfall we have today I continuously face the issue of  selection of high-level important approaches to the problem. Many ideas are definitely important and lead to accuracy improvement but they are certainly not counted as core ones. Like another feature extraction algorithm that could bring you 2% of performance improvement. I definitely miss some high-level up-to-date reviews that could lead into the world of possible approaches taken and their advantages and disadvantages. I was counting on books in that, but unfortunately they aren't as accessible as papers.

Some time ago I went into reading the core face detection paper by Viola and Jones about Haar cascades for object detection. It struck me that their method which appeared to be very fruitful in face and object detection didn't get into common practice in speech recognition.

Basically the idea of their method is that it's possible to reduce search space significantly with very weak set of classifiers. For example you can easily find out that there is no face on the green grass and thus you can skip this region. This is rather fruitful idea that you can classify negatives much more accurately then positives. Putting things into cascade make search space tiny and recognition fast and efficient. Certainly it's not the only algorithm of this type, other one I met recently is bloom filters with almost the same method for efficient hash search.

The transfer of this into ASR is rather straightforward. We need to train weak classifiers that reject phone hypothesis for a given set of frames. That's actually quite easy with SVM or something built on top of existing HMM segmentation. Next, we could also apply this to a language model and reject some hypothesis which aren't possible in the language.

I haven't seen any papers on that, probably I need to search more. This idea is certainly worth to try and it should get into common ASR practices like discriminative training, adaptation with linear regression or multipass search.

Great Move To Nexiwave

We decided to move all our blogs like Ben's blog, news about SearchMyMeetings and others to Such consolidaion will help us to manage our resources as well as will improve our presence in the web. Being more officially placed we will be more responsible for content as well, so I hope to find out here more useful matherials about speech recogniton, CMUSphinx and other related things soon.

Sorry for the inconvenience.

Intelligent Testing In ASR

To continue previous topic about testing I want to share the information about nice paper I read some time ago which I wanted to bring to our daily practices.

The issue is that the current way we test our systems is far from being optimal, at least there no real theory behind that. I usually apply 1/10th rule in practice where I split data on 9/10 training set and 1/10 testing set. This was done in voxforge as well. Not so good thing since with 70 hours of Voxforge data test set grows to 7 hours and it takes ages to decode it. I took this rule from festival's traintest script. And that's more or less common practice in ASR while things like 10-fold cross-validation aren't popular for computational reasons mostly. Suprisingly, problems like that could be easily solved if only we could focus on real goal of testing - estimating the recognition performance. All our test sets are oversized, one could easily find that looking on decoder results during testing. They tend to stabilize very quickly unless there is some data inconsistency.

Speech recognition practice unfortunately doesn't cover this even in scientific papers. Help comes from character recognition. The nice paper I found is:

Isabelle Guyon, John Makhoul, Fellow, IEEE, Richard Schwartz, and Vladimir Vapnik
What Size Test Set Gives Good Error Rate Estimates?

Authors address the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate. The paper is well written and actually rather clear to understand. There are no complex model behind the testing, nothing speech-specific. There are two valuable points there:
  1. The approach that puts reasoning behind test process
  2. The forumlae itself
To put it simple:  

The test set for medium vocabulary task could be small. If word error rate is expected to be like 10%, by the table on page 9 you can get that to compare two configurations with difference 0.5% absolute you need only 13k words data size. 

That's four times smaller than current Voxforge test set. I think this estimate can be even improved if we'll specialize with speech. I really hope this result will be useful for us and will help us to speedup the process of application testing and optimization.

Testing ASR with Voxforge Database

In development and research the critical issue is proper testing. There was some buzz about that recently, for example at MLoss blog where pros for using open data are considered. One interesting resource that started some time ago is, which combines both open data and open algorithm automatically selecting the best method for the common data set. I think it's not that easily implementable idea because "best" is often different. Sometimes you need speed, sometimes generalization.

In our case by using open data you can easily solve the following problems:

  1. Test the changes you've made in speech decoder and trainer on a practical large-vocabulary database
  2. Estimate how recognition engine performs. It's not just about estimating the accuracy but also about other critical parameters like confidence score quality, decoding speed, lattice variability, noise robustness and so on.
  3. Share the bugs you've found. The situation is that we could definitely fix minor problems that are easy to reproduce. Any serious problem ultimately requires a reproducable test example.

I actually wanted to describe how this works in practice right now. The solution we propose for CMUsphinx developers is a Voxforge database. It's not the only open data source out there, but I think it's most permissive one. Old an4 is good for quick tests, but it definitely doesn't satisfy our needs because everything except large vocabulary recognizer have little sense nowdays.

Finally Sorted Out Workshop Materials

Since they are more CMUSphinx official documents, I posted notes about workshop and meetins after that on the website:

Sphinx Users And Developers Workshop 2010 Results

Development Meeting Notes

I'm pleased to get so many important things planned and few very important issues cleared up. For example I didn't completely understand why lattices in sphinx4 are so bad. I hope other participants had some productive results too.

It would be nice to get some pictures from workshop, but unfortunately none available now. Probably sometime later. So, just Dallas one.


So, I'm back from ICASSP in Dallas, TX. It was very impressive conference with lots of interesting and inspiring presentations, meetings and discussions. Amazing everyone was there and I've finally met all the speech people who guided me for so long time. I've met ASR people Bhiksha Raj, David Huggins-Daines, Rita Singh, Richard Stern and TTS people Alan W. Black, Keiichi Tokuda, Simon King, Heiga Zen. I was pleased to  meet second time wonderful guys like Evandro and Peter. Worth to mention that I was able to listen talks by famous people like Hynek Hermanski.

We had Sphinx Users and Developers Workshop  there and also two CMU Sphinx development planing meetings. But they are  subject for another post. This one is just about interesting ideas presented on the conference by other people. I didn't have time to attend every presentation out there, I think it was impossible. You have to find the time for sightseeing and there were often two or three parallel lecture sessions and also poster presentations which I liked most. I think poster presentation is the best way to access author, ask him questions and get feedback. Many posters were so popular it was almost impossible to get to the stand.

Anyway, the amount of talks I've got already exceeds what can be consumed in a week. It would be nice to get one day all information about current research collected into structure or wiki-like resource. It's a huge work for  the future though.

So here are some presentations and ideas I've met there and found them to be worth attention:

Robust Speaking Rate Estimation Using Broad Phonetic Class Recognition

Authors: Jiahong Yuan; University of Pennsylvania, Mark Liberman;
University of Pennsylvania

Presented work is about using easy classifier to get some specific data about speech like to estimate  deletions in syllables and thus speech quality. This is actually very promising approach which is ignored for some reason in most places where it seems to be practical. For example it's not clear for me why speaker identification framework doesn't try to find phonetic classes first and build GMM only after that. It seems to be a natural approach to improve SID performance.

Broad phonetic classes remind me the idea from the famous face recognition algorithm by Viola and Jones about applying cascade for fast classification. This idea could be applied to speech in some form like authors suggest I think.

Clap Your Hands! Calibrating Spectral Substraction for Reverberation

Authors: Uwe Zaeh, Korbinian Riedhammer, Tobias Bocklet, Elmar Noeth

Reverberation was very popular on this conference and especially it's important for meetings. Various speech system require various noise cancellation. Far microphone need to fix reverberation from room, close microphones need to fix clicks and so on. Far microphones sometimes do calibration for reverberation estimation. This defines the set of components sphinx4 could have to deal with various environment conditions. Right now they are simply missing.

Detecting Local Semantic Concepts in Environmental Sounds using Markov Model

Authors: Keansub Lee, Daniel Ellis, Alexander Loui

Interesting that classification database for this task is available at, this could be a base for non-speech recognition research.

Learning Task-Dependent Speech Variability In Discriminative Acoustic Model Adaptation

Authors: Shoei Sato, Takahiro Oku, Shinichi Homma, Akio Kobayashi, Toru Imai

Discriminative approaches are popular now days. Direct optimization of the cost function could serve on various stages of training process. In this work for example the set of subword units is selected to minimize decoding error rate.

An Improved Consensus-Like method for Minimum Bayes Risk Decoding and Lattice Combination

Authors: Haihua Xu, Daniel Povey, Lidia Mangu, Jie Zhu

This deals with specific criterion for lattice decoding. Not just best path could be chosen but other criterion like consensus could also apply. For me personally it would be very interesting to formalize and apply the criterion that will ensure grammatical correctness of the result. I haven't found anything on this yet.

Discriminative training based on an integrated view of MPE and MMI in margin and error space

Authors: Erik McDermott, Shinji Watanabe, Atsushi Nakamura
Interesting to find out that real math goes into ASR. Basically it was a long waited thing and it seems it was started by Georg Heygold with his works on MMI and other methods. It would be nice to review this area to get some idea what's the outcome of it. Heygold was sited in almost every presentation, so it's really getting popular.

Balancing False Alarms and Hits in Spoken Term Detection

Authors: Carolina Parada, Abhinav Sethy, Bhuvana Ramabhadran

It's interesting to see what tools are used. WFST's are very convenient and used by everyone. IBM, Google, AT&T. This is also a topic for separate post.

Bayesian Analysis of Finite Gaussian Mixtures

Authors: Mark Morelande, Branko Ristic

Rather old idea (there are similar papers from 1998) to use Bayesian learning to estimate number of mixtures in the model. I'm in favor of the approach to estimate all model parameters including number of mixtures, language weight and number of senones at once.

Improving Speech Recognition by Explicit Modeling of Phone Deletions

Authors: Tom Ko

Pronunciation variation by phone deletion looks very promising since traditional linguists mostly complain about sequential HMM model which doesn't handle deletions correctly. Unfortunately, the effect  of this seems to be small. The improvement cited is only from 91.5% to 92%.

An Efficient Beam Pruning With A Reward Considering The Potential To Reach Various Words.

Authors: Tsuneo Kato, Kengo Fujita, Nobuyuki Nishizawa

Beam pruning according to the number of reachable words or to other the risk function. Good idea to implement in sphinx4 to speedup recognition. Factor cited is 1.2 for a large vocabulary.

That's it. I missed Friday and most early mornings unfortunately, so something  interesting could be there. I'm sure you could select your own set. It's  interesting to look on it.

Sphinx4 1.0 beta4 Is Released. What's next?

So, almost according to schedule, sphinx4 was released yesterday. Check the notes at

Most notable improvements were already discussed here, so let me try to plan what the next release will be. Trying to be realistic in plans, I don't want to promise everything at once. Here is some attempt to forecast the next release notes

The biggest issue with sphinx4 is actually documentation. Current poll on CMUSphinx website clearly shows that. Personally I sometimes think that perfect documentation will not help if system doesn't work, but at least it will make product attractive and easy to use. My idea is that we need to have more developer-level documentation - tutorial, examples, task-oriented howtos. It's unlikely we'll be able to write something that is good enough as textbook on speech technologies. But we need to prove the point that it's possible to build ASR system without understanding who is Welch.

On the code side, we face a biggest challenge since sphinx4 was designed. We need to move to the multipass system. It's not just about rescoring, it's about plugging diarization framework from LIUM, it's also about making sphinx4 suitable for both batch and live applications. That's the serious issue.

The reason is that currently sphinx4 architecture is flow-oriented. It's built like a single pipe of components each passing audio to other. This is good for live applications, but not so good for batch ones. You get troubles when you need to split pipe or merge it later. In batch application one could have a huge benefit from looking on recording as a whole and returning to recording multiple times. For example, you could estimate noise level properly and just cleanup audio on the second pass. Such multipass decoding doesn't well fit into pipe paradigm. On the other side, changing it to purely batch will create issues for live applications.

So we are in trouble. We have to invent some combined scheme probably and create a hybrid of pipe and batch approaches. I was thinking about knowledge base scheme when information about stream is stored in some database as processing goes. Database cleanup policies could emulate both pipe (when database is immediately cleaned) and batch approaches (when database is kept even over sessions). Festival utterances remind me such data processing scheme between. Anyway, this idea is not finalized yet.

We also expect to see a lot of movement from CMUSphinx Workshop in Dallas and in Google Summer of code participation. I hope issues described above and some more interesting issuses will be resolved till next release in August. Let's discuss the rest then!

Speech Recognition in GSoC Done Right

From year to year many end-user projecs are trying to push ASR with the help of Google and studens of the Summer Of Code program. If CMUSphinx team knows all about ASR, why should we stay away from that?

I had diverse experience with Google Summer Of Code before, but I still like this process and enjoy communication with new people. I think we have good chances to succeed here. So I started and filed an application proposal and the initial list of ideas

I will submit this proposal on March 8 after program start. We need more ideas now. As much as you can generate

We need to have more or less representative list. If you want to be a mentor, don't hestitate to write down your irc nick as well.

Noise reduction filtering in sphinx4

There is a huge gap between stock sphinx4 and real ASR system since critical parts like noise filtering, speaker diarization and postprocessing are missing. Not to mention the online adaptation. The default frontend is less then optimal for several reasons. For example it doesn't handle DC offset at all, it also uses energy-based endpointer in time domain, thus not so robust to additive noise.

As of today sphinx4 includes the implementation of Wiener filter that reduce noise and helps the voice activity detector as well. To try it checkout latest trunk and change the frontend pipeline as following:

<item>audioFileDataSource </item>
<item>dataBlocker </item>
<item>preemphasizer </item>
<item>windower </item>
<item>fft </item>
<item>wiener </item>
<item>speechClassifier </item>
<item>speechMarker </item>
<item>nonSpeechDataFilter </item>
<item>melFilterBank </item>
<item>dct </item>
<item>liveCMN </item>
<item>featureExtraction </item>

Then define wiener component:

<component name="wiener"
<property name="classifier" value="speechClassifier"/>

This frontend is stable to DC and also handles noise better. To try the noisy input, you could mix white noise with sox:

 sox 10001-90210-01803.wav noise.wav synth white
 sox noise.wav smallnoise.wav vol -45d
 sox -m 10001-90210-01803.wav smallnoise.wav 10001-90210-01803-noisy.wav

It would be nice to try with Aurora database as well.

This filter is very simple and has a number of disadvantages. For example it corrupts spectrum with harmonic noises sometimes and thus makes recognition even worse. But it definitely helps in presense of noise. Let's hope one day more sophisticated implementations like Ephraim-Malah filter, or even noise reduction with vector taylor series will be made available in default configurations.

All ideas are already generated

After seeing flash websites take enormous amount of my CPU got a cool idea today about using flash for distributed computing. Basically everything is already in place. You setup webserver, share content with flash, it runs on client computer and does calculations uploading the result from time to time. Certainly I wasn't the first who invented that, see for example


Though such ideas are rather recent and the question is how to make this framework widely used. Looking at current load of the computer at sourceforge it's most likely already used by some websites :)

Training process

What I really like in Sphinxtrain is that it provides straightforward way for training an audio model. It remains unclear for me why everyone bothers with HTKBook while there is clean an easy way to train the model. One should just define the dictionary and transcription and put the files in the proper folder. Anyway, I'm continuously thinking about the way sphinxtrain process could be improved. Currently it indeed lacks a lot of critical information on training and that makes look uncomplete.

Basically here is what I would like to put into the next versions of sphinxtrain and sphinxtrain tutorial:

  1. Description on how to prepare the data
  2. Building of the database transcription. Between, what bothers me last month is the requirement to have fileids. I really think the file with fileids could be silentely dropped. What's the problem to get the id of the file from the transcription labels 
  3. Automatic splitting on training data, testing data and development data. I see development data presense as a hard requirement for the training process. Unfortunately, current documentation lacks it. There could be code to do that, but for most databases it's automatic of course.
  4. Bootstrapping from a hand-labelled data. I think this as an important part of training, HTK results confirm that. In general it repeats human language learning, so I think it's natural as well.
  5. Training
  6. Optimizing number of senones, mixtures on a devel set
  7. Optimizing most important parameters like language weight on the development set. This part is complicated as I see it. First of all the reasononing behind proper language weight scaling is still unclear for me, I could one day write a separate post on it. Basically it depends on everything, even on the decoder
  8. Testing on the test set 
 If it will be possible to keep this as straightforward as it is now that would be just perfect. Probably if I'll start to write the chapter in a week, this could be ready till summer.

Moving Beyond the `Beads-On-A-String'

Recently I've got interested in quite a large domain of speech recognition research where old school linguistic meets modern speech recognition. Basically the idea is that in spontaneous speech variativity is so huge that phonetic transcription from the dictionary doesn't apply well. In plain CMUSphinx setup linguistic information about phones is almost lost like we don't care if phone is labial or dental. It is used in a decision tree building but it's not clear if such usage helps. It's definitely not so good to drop such a huge amount of information that could help with classification. So this idea is actively developed and you can find there everything you miss probably - distinctive phone features, landmarks, spectrogram recognition.

I went through the following articles, the number of methods, approaches and implementations described there is really huge. In other articles it's going to be even bigger:

S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester. Speech production knowledge in automatic speech recognition. Journal of the Acoustical Society of America, 121(2):723-742, February 2007. PDF
Moving Beyond the `Beads-On-A-String' Model of Speech by M. Ostendorf PDF

Speaking In Shorthand - A Syllable-Centric Perspective For Understanding Pronunciation Variation by Steven Greenberg PDF

To be honest the only idea from the articles that grown in my mind is that reductions on fast speech are root of the problem. I also noticed it in early days and was experimenting with a skip states. Skips didn't give any improvements except reduced speed. It will probably help to automatically increase lexicon variability and use forced alignemnt to get proper pronuciation at least at training stage. As I understood I just need to take a dictionary with syllabification and create a dictionary with a lot of reduced variants where onsets are kept as as and codas are reduced in some form. Then we force align, then train. Probably acoustic model will be better then.

Another striking point was that I haven't found any significant accuracy improvement result in the articles I read. Improvement like 20% with discriminative training could make any method widely adopted but nothing like that is mentioned. Probably this research is in very initial state.

Three Generation of IVR Systems

Recently I invented new nice concept for marketing people. Basicallly there are three generations of IVR systems right now:
  • Generation 1.0 - Static systems based on VoiceXML. It was suprising for me they are in wide use now and a lot of products are dedicated to their optimization/develoment. There are IDE's and a lot of testing tools, recommendations how to build proper VoiceXML. Come on, it's impossible to do that. It's something like static HTML websites that were popular in 1995. I don't believe any changes like javascript inside in VXML 3.0 will stop it slow death.
  • Generation 2.0 - Dynamic systems like Tropo from Voxeo. Much easier, much better. More control over content, more integration with the business logic. I really believe it's next generation because it gives developer much more control over the dialog. At least with the power of real scripting language like Python you'll be able to implement something non trivial with just several lines of code. That's AJAX or ROR in speech world.
  • Generation 3.0 - Semantic based IVR. This consists of three components - large vocabulary recognizer, semantic recognizer on top of it and even-based actions on top of it. Probably also an emotion recognition and more intelligent dialog tracking. As I see the developer has to define the structure of the dialog and provide handlers. Such system was described and developed  in CMU long time ago already and also it's described in all ASR textbooks. But I'm not aware of any widely known platform allowing to do this kind of IVR. Once again it shows how big the gap is between the academia and software developers.
If you are planning to create IVR application with CMUSphinx, please, consider IVR generation 3 as your base technology ;) And don't forget to share the code.


Very much on the same topic from a wonderful Nu Echo blog:

PLP is going to be default soon

It looks like MFCC features are going to become a history. Everyone is using 9 combined PLP frames + later LDA projection to 40-50 values. Few examples including Google in it's audio indexing system, IBM and BBN see system description in results, OGI/ICSI and many others.

The issue right now is that sphinx4 PLP implemetation seems to be broken, it produces kind of garbage features which doesn't give enough accuracy after training. Luckily there is HTK. Once this issue will get fixes, I think I'll retrain PLP + MLLT model for Voxforge. Unfortunately I don't have any definite plan for implementation of PLP in sphinxbase.

Greetings and Random Thoughts

So 2010 is here, Happy New Year everyone. Wish you all success and happiness and of course increased decoder accuracy! Now we have a long 10 days vacation in Russia, time to travel, eat, drink and sort out bookmarks, read books on the shelf and watch pending google tech talks. Santa also promised me to do some great changes in sphinx4, waiting for that as well.

Though Ohloh doesn't confirm that, I have a strong feeling that last year the activity around CMUSphinx definitely increased and it's usage is going to grow.

I was thinking a little what should be the direction of sphinx4 development, I think we should consider several factors here. I would be happy to see it as widely-used enterprise level speech recognition engine with a great list of features, but I completely understand that due to the lack of resources it's naive to think we'll be able to do it all. We definitely need to find a market sector for the sphinx project and grow using it. There are already well established projects like HTK that are used widely with their own set of strong and weak features. Julius is used widely as a large vocabulary speech recognition engine with HTK models. It's hard to compete with HTK for us just because it will take years to add that flexibility we probably don't even need. Consider variable of adjustable number of states per phone, something that is only proven to be useful for a small vocabulary task, something we aren't really interested in and I hope will not be interested in a near future. What could be different is our practical orientation.

Many project in speech domain and releated areas are often grown from the research projects and though flexible sometimes, often really unusable in applications since they aren't really designed for that. Usually a research project isn't well documented, has a lot of ways to implement the same thing and some of them are sometimes obsolete. Bugs are rarely fixed and documentation almost missing. Releases are not stable. It's definitely a large field for a commercial support company.

There is a different side, many projects are created in order to solve the user needs, more or less well documented and have stable interfaces, large open community but they are doing so wrong internally I always wonder how they are used at all. Espeak with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn't let it be good with any possible modifications. Another example of this is strikingly Lucene. Unlike lucidimagination blog states states lucene community is thriving, it's definitely not true. The research articles like Lucene and Juru at Trec 2007: 1-Million Queries Track definitely shows there is something wrong with Lucene. Basically it lists several trivial changes well known in research community that make Lucene perform two times better on a standard test. I can't understand why this wasn't integrated into stock after three years since article was published.

Let's hope CMUSphinx will find it's place somewhere in the middle. Also, let's hope this year will bring more useful posts decreasing information overload that is certainly going to be a problem in a near future.

Blog Archive