nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

What's the reason of your CMU Sphinx interest

Monday, March 1, 2010

Sphinx4 1.0 beta4 Is Released. What's next?



So, almost according to schedule, sphinx4 was released yesterday. Check the notes at

http://cmusphinx.sourceforge.net/2010/03/sphinx4-1-0-beta-4-released/

Most notable improvements were already discussed here, so let me try to plan what the next release will be. Trying to be realistic in plans, I don't want to promise everything at once. Here is some attempt to forecast the next release notes

The biggest issue with sphinx4 is actually documentation. Current poll on CMUSphinx website clearly shows that. Personally I sometimes think that perfect documentation will not help if system doesn't work, but at least it will make product attractive and easy to use. My idea is that we need to have more developer-level documentation - tutorial, examples, task-oriented howtos. It's unlikely we'll be able to write something that is good enough as textbook on speech technologies. But we need to prove the point that it's possible to build ASR system without understanding who is Welch.

On the code side, we face a biggest challenge since sphinx4 was designed. We need to move to the multipass system. It's not just about rescoring, it's about plugging diarization framework from LIUM, it's also about making sphinx4 suitable for both batch and live applications. That's the serious issue.

The reason is that currently sphinx4 architecture is flow-oriented. It's built like a single pipe of components each passing audio to other. This is good for live applications, but not so good for batch ones. You get troubles when you need to split pipe or merge it later. In batch application one could have a huge benefit from looking on recording as a whole and returning to recording multiple times. For example, you could estimate noise level properly and just cleanup audio on the second pass. Such multipass decoding doesn't well fit into pipe paradigm. On the other side, changing it to purely batch will create issues for live applications.

So we are in trouble. We have to invent some combined scheme probably and create a hybrid of pipe and batch approaches. I was thinking about knowledge base scheme when information about stream is stored in some database as processing goes. Database cleanup policies could emulate both pipe (when database is immediately cleaned) and batch approaches (when database is kept even over sessions). Festival utterances remind me such data processing scheme between. Anyway, this idea is not finalized yet.

We also expect to see a lot of movement from CMUSphinx Workshop in Dallas and in Google Summer of code participation. I hope issues described above and some more interesting issuses will be resolved till next release in August. Let's discuss the rest then!

Saturday, February 27, 2010

Speech Recognition in GSoC Done Right

From year to year many end-user projecs are trying to push ASR with the help of Google and studens of the Summer Of Code program. If CMUSphinx team knows all about ASR, why should we stay away from that?

I had diverse experience with Google Summer Of Code before, but I still like this process and enjoy communication with new people. I think we have good chances to succeed here. So I started and filed an application proposal and the initial list of ideas

http://cmusphinx.sourceforge.net/wiki/summerofcode2010

I will submit this proposal on March 8 after program start. We need more ideas now. As much as you can generate

http://cmusphinx.sourceforge.net/wiki/summerofcodeideas

We need to have more or less representative list. If you want to be a mentor, don't hestitate to write down your irc nick as well.

Friday, February 12, 2010

Noise reduction filtering in sphinx4

There is a huge gap between stock sphinx4 and real ASR system since critical parts like noise filtering, speaker diarization and postprocessing are missing. Not to mention the online adaptation. The default frontend is less then optimal for several reasons. For example it doesn't handle DC offset at all, it also uses energy-based endpointer in time domain, thus not so robust to additive noise.

As of today sphinx4 includes the implementation of Wiener filter that reduce noise and helps the voice activity detector as well. To try it checkout latest trunk and change the frontend pipeline as following:

<item>audioFileDataSource </item>
<item>dataBlocker </item>
<item>preemphasizer </item>
<item>windower </item>
<item>fft </item>
<item>wiener </item>
<item>speechClassifier </item>
<item>speechMarker </item>
<item>nonSpeechDataFilter </item>
<item>melFilterBank </item>
<item>dct </item>
<item>liveCMN </item>
<item>featureExtraction </item>

Then define wiener component:

<component name="wiener"
type="edu.cmu.sphinx.frontend.endpoint.WienerFilter">
<property name="classifier" value="speechClassifier"/>
</component>

This frontend is stable to DC and also handles noise better. To try the noisy input, you could mix white noise with sox:

 sox 10001-90210-01803.wav noise.wav synth white
 sox noise.wav smallnoise.wav vol -45d
 sox -m 10001-90210-01803.wav smallnoise.wav 10001-90210-01803-noisy.wav

It would be nice to try with Aurora database as well.

This filter is very simple and has a number of disadvantages. For example it corrupts spectrum with harmonic noises sometimes and thus makes recognition even worse. But it definitely helps in presense of noise. Let's hope one day more sophisticated implementations like Ephraim-Malah filter, or even noise reduction with vector taylor series will be made available in default configurations.

Sunday, January 31, 2010

All ideas are already generated

After seeing flash websites take enormous amount of my CPU got a cool idea today about using flash for distributed computing. Basically everything is already in place. You setup webserver, share content with flash, it runs on client computer and does calculations uploading the result from time to time. Certainly I wasn't the first who invented that, see for example

http://www.vershun.com/computers/hidden-flash-applications-as-distributed-computing-clients.html

and

http://www.csc.villanova.edu/~tway/courses/csc3990/f2009/csrs2009/Kevin_Berry_Grid_Computing_CSRS_2009.pdf

Though such ideas are rather recent and the question is how to make this framework widely used. Looking at current load of the computer at sourceforge it's most likely already used by some websites :)

Training process

What I really like in Sphinxtrain is that it provides straightforward way for training an audio model. It remains unclear for me why everyone bothers with HTKBook while there is clean an easy way to train the model. One should just define the dictionary and transcription and put the files in the proper folder. Anyway, I'm continuously thinking about the way sphinxtrain process could be improved. Currently it indeed lacks a lot of critical information on training and that makes look uncomplete.

Basically here is what I would like to put into the next versions of sphinxtrain and sphinxtrain tutorial:

  1. Description on how to prepare the data
  2. Building of the database transcription. Between, what bothers me last month is the requirement to have fileids. I really think the file with fileids could be silentely dropped. What's the problem to get the id of the file from the transcription labels 
  3. Automatic splitting on training data, testing data and development data. I see development data presense as a hard requirement for the training process. Unfortunately, current documentation lacks it. There could be code to do that, but for most databases it's automatic of course.
  4. Bootstrapping from a hand-labelled data. I think this as an important part of training, HTK results confirm that. In general it repeats human language learning, so I think it's natural as well.
  5. Training
  6. Optimizing number of senones, mixtures on a devel set
  7. Optimizing most important parameters like language weight on the development set. This part is complicated as I see it. First of all the reasononing behind proper language weight scaling is still unclear for me, I could one day write a separate post on it. Basically it depends on everything, even on the decoder
  8. Testing on the test set 
 If it will be possible to keep this as straightforward as it is now that would be just perfect. Probably if I'll start to write the chapter in a week, this could be ready till summer.

Thursday, January 21, 2010

Moving Beyond the `Beads-On-A-String'

Recently I've got interested in quite a large domain of speech recognition research where old school linguistic meets modern speech recognition. Basically the idea is that in spontaneous speech variativity is so huge that phonetic transcription from the dictionary doesn't apply well. In plain CMUSphinx setup linguistic information about phones is almost lost like we don't care if phone is labial or dental. It is used in a decision tree building but it's not clear if such usage helps. It's definitely not so good to drop such a huge amount of information that could help with classification. So this idea is actively developed and you can find there everything you miss probably - distinctive phone features, landmarks, spectrogram recognition.

I went through the following articles, the number of methods, approaches and implementations described there is really huge. In other articles it's going to be even bigger:

S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester. Speech production knowledge in automatic speech recognition. Journal of the Acoustical Society of America, 121(2):723-742, February 2007. PDF
 
Moving Beyond the `Beads-On-A-String' Model of Speech by M. Ostendorf PDF

Speaking In Shorthand - A Syllable-Centric Perspective For Understanding Pronunciation Variation by Steven Greenberg PDF

To be honest the only idea from the articles that grown in my mind is that reductions on fast speech are root of the problem. I also noticed it in early days and was experimenting with a skip states. Skips didn't give any improvements except reduced speed. It will probably help to automatically increase lexicon variability and use forced alignemnt to get proper pronuciation at least at training stage. As I understood I just need to take a dictionary with syllabification and create a dictionary with a lot of reduced variants where onsets are kept as as and codas are reduced in some form. Then we force align, then train. Probably acoustic model will be better then.

Another striking point was that I haven't found any significant accuracy improvement result in the articles I read. Improvement like 20% with discriminative training could make any method widely adopted but nothing like that is mentioned. Probably this research is in very initial state.

Saturday, January 16, 2010

Three Generation of IVR Systems

Recently I invented new nice concept for marketing people. Basicallly there are three generations of IVR systems right now:
  • Generation 1.0 - Static systems based on VoiceXML. It was suprising for me they are in wide use now and a lot of products are dedicated to their optimization/develoment. There are IDE's and a lot of testing tools, recommendations how to build proper VoiceXML. Come on, it's impossible to do that. It's something like static HTML websites that were popular in 1995. I don't believe any changes like javascript inside in VXML 3.0 will stop it slow death.
  • Generation 2.0 - Dynamic systems like Tropo from Voxeo. Much easier, much better. More control over content, more integration with the business logic. I really believe it's next generation because it gives developer much more control over the dialog. At least with the power of real scripting language like Python you'll be able to implement something non trivial with just several lines of code. That's AJAX or ROR in speech world.
  • Generation 3.0 - Semantic based IVR. This consists of three components - large vocabulary recognizer, semantic recognizer on top of it and even-based actions on top of it. Probably also an emotion recognition and more intelligent dialog tracking. As I see the developer has to define the structure of the dialog and provide handlers. Such system was described and developed  in CMU long time ago already and also it's described in all ASR textbooks. But I'm not aware of any widely known platform allowing to do this kind of IVR. Once again it shows how big the gap is between the academia and software developers.
If you are planning to create IVR application with CMUSphinx, please, consider IVR generation 3 as your base technology ;) And don't forget to share the code.

Update:

Very much on the same topic from a wonderful Nu Echo blog:

http://blog.nuecho.com/2010/01/25/voice-apis-back-to-basics/

Blog Archive

About Me

My Photo
nshmyrev
Moscow, Russia
Nowdays I mostly work on open source projects in speech recognition and synthesis like Festival, CMU Sphinx and Voxforge. I also support the Russian parts of those projecs, providing the leading product in ASR and TTS in Russian. In the past I used to participate in GNOME, work on embedded Linux devices and on software development technologies related to automatic software verification and modelling. If you have any questions feel free to contact me by mail nshmyrev at nexiwave dot com or find me in jabber/irc.
View my complete profile