Magic Words of Interspeech 2011

Interspeech 2011 is coming. It going to be an amazing event I suppose. If you are interested what is going on there, let's figure that out.

To keep things simple we will use Unix command line tools. Sometimes text processing could be fun even with simple commands. Text is still most conventint form of the information presentation, way better than HTML or databases. Of course there is lack for more advanced things like stopword filtering or named entity recognition. Let's hope one day Unix command line will have them.

1. Download full printable programs of Interspeech 2010 and Interspeech 2011 with wget, dump them to text with lynx and cleanup punctuation with sed.

2. Dump word counts with SRILM tool ngram-count and cut 1000 most frequent words on list for 2011 with head and sort. Leave all words in 2010 list.

3. Figure out which of the words in 2011 list are new and do not appear in 2010 list with sort and uniq.

Suprisingly there will be only 2 new words. They are: i-vector and crowdsourcing.

When Language Models Fail

Language modeling still have many challenging problems.


Comic by Jim Benton

Decoders And Features

CMUSphinx decoders in a glance, so one can compare. Table is incomplete and imprecise of course.




sphinx2 sphinx3 sphinx4 pocketsphinx
Acoustic Lookahead
-
-
+
-
Alignment
+
+
+
-
Flat Forward Search
+
+
-
+
Finite Grammar Confidence
+
-
-
-
Full n-gram History Tree Search
-
-
+
-
HTK Features
-
+
+
+
Phonetic Loop Decoder
+
+
-
-
Phonetic
Lookahead
+
+
-
+
PLP features
-
-
+
-
PTM Models
-
-
-
+
Score Quantization
+
-
-
+
Semi-Continuous Models
+
+
-
+
Single Tree
Search
+
-
-
+
Subvector
Quantization
+
+
-
+
Time-Switching
Tree Search
-
+
-
-
Tree Search Smear
-
+
+
-
Word-Switching
Tree Search
-
+
-
-
Thread Safety
-
-
+
+
Keyword Spotting
-
-
+
-

And here is the description of the entries

Specific Applications

Phonetic Loop Decoder. Phonetic loop decoding requires specialized search algorithm. It's not implemented in Sphinx4 for example.

Alignment. Given text and the transcription get the word timings.

Keyword spotting. Search for keyword requires separate search space and different search approach.

Finite Grammar Confidence. Get confidence estimation for finite state grammar. This is a complex problem which
require additional operations during search, for example phone loop pass.

Effective pruning

Acoustic Lookahead. Using acoustic score for the current frame we can predict the score for the next frame
and thus prune token early.

Phonetic Lookahead. Using phonetic loop decoder we can predict possible phones and thus restrict large vocabulary search.

Features

HTK Features. CMUSphinx feature extraction is different from HTK (different filterbank and transform). To provide HTK capability one needs to have specific HTK feature extraction.

PLP features. Type of the features different from traditional MFCC. They are more popular now.

Search Space

Flat Forward Search. Search space when word paths aren't joined in lextree. Separated path lets us to apply language model probability earlier. Thus search is more accurate. But because search space is bigger it's also slower. Usually flat search is applied as a second pass after tree search.

Full n-gram History Tree Search. Tokens which have different n-gram history are tracked separately. For example token for "how are UW " and token for "hello are UW.." are tracked separately. In pocketsphinx such tokens are just joined and only best one survive. Full history search is more accurate but slower and more complex in implementation.

Word-Switching Tree Search. Separate lextrees are kept for each unigram history. This search is in the middle between the one to keep the full history and another one to drop the history at all.

Single Tree Search. Lextree tokens don't care about word history. This is faster but less accurate way.

Time-Switching Tree Search. Lextree states don't care about word history but several lextrees are kept in memory (3-5). In this time switching approach lextrees are switched every frame. Because of that there is higher chance to track both histories.

Tree Search Smear. Lextree contains unigram probability and thus it's possible to prune token earlier based on the language score.

Acoustic Scoring

PTM Models. Models when gaussians are shared across senones with same central phone. So we don't need to calculate gaussians value for each senone, just few values for each central phone. Then using different mixture weights we get senone score. This approach reduce computation required but keeps accuracy on a reasonable level. It's similar to semi-continuous models where gaussians are shared across all senones, not just across senones with same central phone.

Score Quantization. Acoustic scores in some cases could be represented by just 2 bytes (semi-continuous models and specific feature set). Usually scores are in log domain and shifted by 10 bits. This reduces memory required for acoustic model and for scoring and speeds up the computation in particular on CPU without FPU.

Semi-Continuous Models. Gaussians are shared across all senones, only mixture weights are different. Such models are fast and usually quite accurate. Usually they are multistream (s2_4x or 1s_c_d_dd with subvector 0-12/13-25/26-38) since separate streams could be better quantized.

Subvector Quantization. Gaussian selection approach to reduce acoustic scoring. Basically continuous model after training is deconstructed on several subvector gaussians which are shared across senones and thus scored efficiently.


Cars Controlled By Speech

Being a speech recognition guy I'm looking for a car with speech recognition included. Sounds strange to select car just because of it, but that is just kidding. So far the list is:

  • Honda Accord
  • Any Ford 2011
  • Mazda 6
Not listing something expensive like BMW or Mersedes. Hm, it looks almost everyone is doing that. Any others? Which is the most advanced one?

Some details on particular implementation

Ford SYNC 2011

Quite advanced system. Command-based. Supports many types of commands to control dvd or get baseball scores. Supports user profiles but doesn't seem like it has specific training procedure. With current speaker recognition capabilities it could in theory adapt to users automatically without profiles.

Mazda 6 2011

Pretty interesting system, but limited comparing to previous one. According to owner manual it supports a very limited list of commands to manage calls, get incoming messages and. From interesting capabilites it supports training and voice entry for contacts. Three languages - English, French, Spanish. Looks like it's using single microphone. Looks like voice navigation system has separate speech recognition subsystem.
    Honda Fit 2009


    Many commands mostly related to navigation but no user adaptation and no profiles. Alphanumeric entry as a backup to vocabulary search. This one is very simple.


    Mitsubishi/Hyundai 2011


    I didn't manage to find the manual on them. Feature name "Bluetooth hands-free phone system with voice recognition and phonebook download" makes me think it's the same system as in Mazda.


    Talkmatic

    Doesn't seem like this is deployed, but presentation looks impressive

    KIA

    Accoding to SpeechTechMag Microsoft and Kia codeveloped the UVO multimedia and infotainment system, which the Korean automaker rolled out in its new Sportage, Sorento, and Optima models late last year. UVO lets users access media content and connect with people through  quick voice commands without having to navigate hierarchical menus.


    ICASSP 2011 Part 1 - Thoughts

    It seems like ICASSP this year was a great event, it is pity I missed it. Just comparing the keynotes list, ICASSP beats Interspeech 4:0. ICASSP is very technical, Interspeech is for linguists. Compare the two:

    Making Sense of a Zettabyte World vs Neural Representations of Word Meanings


    New section formats like technical tracks and trends discussions are interesting though I am not sure how they felt in practice.

    So this was the reason to spend few days in reading. 1000 papers on speech technology! Huh. Thanks to all authors for their hard work! Well, I found several duplicates in the end.

    Main thing I noted is that topics of the research are very sparse, for example
    • Everyone does speaker recognition. Appealing problem statement here is that here is to detect a synthetic speaker. Paper titled "DETECTION OF SYNTHETIC SPEECH FOR THE PROBLEM OF IMPOSTURE" by De Leon at al. hints that there is no solution for that.
    • I got tired to skip pursuits, bandiths and compressive sensing
    • On the other side, increased portion of papers on non-speech signals, cocktail party problem, signal recovery is very interesting to read.
    • Things like DBN features or SCARF decoder are widely represented. You can read about applications of CRF from g2p algorithms to dialogs. But traditional things like search algorithms and adaptation are almost uncovered. 
    • It was suprising to find the session dedictated to multimedia security which must be a gold mine of ideas in particular if you need a topic for a paper. Is there a company selling such products? 
    Overall I found several original problem statements as well as inspiring ideas covering very important technology issues. For example it would be nice to implement meeting transcription application with several iPhones to combine streams and later transcribe them using multichannel environment compensation. Several meeting transcription setups and channel separation methods are described in the conference proceedings.

    After reading some amount of papers I found that conference papers are too short. While you see a nice title and an abstract you expect to read a detailed insight into the problem with history discourse and everything explained in detail, a deep investigation of the problem. But you get just a description of the technology and few figures from experiments. On the other side, I will not be able to read 100 papers 20 pages each.

    Very interesting that this year awards are not related to speech technology. That will be the contents of Part 2. I just need to go through last 50 papers left.


    Chicken-And-Egg in Sphinxbase

    Recently Shea Levy pointed me to an issue with a verbose output during pocketsphinx initialization. Basically every time you start pocketsphinx, you get something like


    INFO: cmd_ln.c(691): Parsing command line:
    pocketsphinx_continuous 
    Current configuration:
    [NAME] [DEFLT] [VALUE]
    -adcdev
    -agc none none
    -agcthresh 2.0 2.000000e+00
    -alpha 0.97 9.700000e-01
    -argfile

    It's ok for a tool but not a nice thing for the library which should be a small horse in a rig of application. Not every user is happy seeing all this stuff dumped on the screen. And the worst thing is that there is no way to turn it off because "-logfn /dev/null" works only for the output after initialization. So we are looking to have pocketsphinx completely silent.

    It appeared to be more complex issue than I thought. Its classical chicken-egg issue when you use configuration framework do configure logging but configuration framework needs to log itself. We just hardcoded the initialization but thinking afterwards I found way more complex and but more rigid approach in log4j description from http://articles.qos.ch/internalLogging.html

    Since log4j never sets up a configuration without explicit input from the user, log4j internal logging may occur before the log4j environment is set up. In particular, internal logging may occur while a configurator is processing a configuration file.

    We could have simplified things by ignoring logging events generated during the configuration phase. However, the events generated during the configuration phase contain information useful in debugging the log4j configuration file. Under many circumstances, this information is considered more useful than all the subsequent logging events put together.

    In order to capture the logs generated during configuration phase, log4j simply collects logging events in a temporary appender. At the end of the configuration phase, these recorded events are replayed within the context of the new log4j environment, (the one which was just configured). The temporary appender is then closed and detached from the log4j environment.

    Oh-woh, I will never get enough passion to implement this properly ;) Let it be as is for now.

    Sphinxbase command line options are still not good. I'm pretty much lack proper --help, --version and many more nifty getopt things. One day someone should do this.

    Voicemail transcription with Pocketsphinx and Asterisk (Part 2)

    This is a second part which describes voicemail transcription for Asterisk administrators. See previous part which describes how to setup Pocketsphinx here

    So you have configured the recognizer to transcribe voicemails and now look on the improved recognizer accuracy. Honestly I can tell you that you will not get perfect transcription results for free unless you will send voicemails to some human-assisted transcription company. You will not get them from the Google either. Though there are several commercial services to try like Yap or Phonetag which specialize on voicemails specifically. Our proprietary Nexiwave technology for example uses way more advanced algorithms and way bigger speech databases than distributed with Pocketsphinx. And it's a really visible difference.

    However even the result you can get with Pocketsphinx can be very usable or you. I estimate you can easily get 80-90% accuracy with little effort considering the language of your voicemails is simple.


    CMUSphinx accepted at Google Summer Of Code 2011

    So we are in. Great to know that.

    For more information see

    http://cmusphinx.sourceforge.net/2011/03/cmusphinx-at-gsoc-2011/

    I think it's a big responsibility and a big opportunity as well. Of course we don't consider this as a way to improve CMUSphinx itself or as something that will allow us to get features coded for free. Instead, we are looking for new people to join CMUSphinx, becoming the part of it. Maybe it's a great opportunity for Nexiwave as well.

    For now the task is to prepare the list of ideas for the projects. I know they need to be drafted carefully. If you want to help, please jump in. I definitely need some help.

    Fillers in WFST

    Another practical question is - how do you integrate fillers? There is silence class introduced in

    A GENERALIZED CONSTRUCTION OF INTEGRATED SPEECH RECOGNITION TRANSDUCERS by Cyril Allauzen, Mehryar Mohri, Michael Riley and Brian Roark

    and implemented in transducersaurus.

    But you know each practical model has more than just a silence. Fillers like noise, silence, breath, laugh they all go to specific senones in the model. I usually try to minimize them during the training for example joining all them ums, hmms, and mhms into a single phone but I still think they are needed. How to integrate them when you build WFST recognizer?

    So I tried few approaches. For example instead of adding just a <sil> class in T transducer I tried to create many branches for each filler. As a result final cascade expands to a huge moster. Like if cascade was 50mb after combination with 1 silence class it is 100mb but after 3-4 classes it's 300mb. Not a nice thing to do.

    So I ended in dynamic expansion of silence transitions like this:

    if edge is silence:
       for filler in fillers:
          from node.add_edge(filler)
    

    This seems to work well.

    Word position context dependency of Sphinxtrain and WFST

    Interesting thing about Sphinxtrain models is that it uses word position as a context when looking for a senone for a particular word sequence. That means that in theory a senone for the start word phones could be different from senones for the middle-word phones and senones for the end-word phones. It's actually sometimes the case:

    ZH  UW  ER b    n/a   48   4141   4143   4146 N
    ZH  UW  ER e    n/a   48   4141   4143   4146 N
    ZH  UW  ER i    n/a   48   4141   4143   4146 N
    

    but

    AA  AE   F b    n/a    9    156    184    221 N
    AA  AE   F s    n/a    9    149    184    221 N
    

    Here in the WSJ model definition from sphinx4 a symbol in a fourth column means "beginning", "end", "internal" or "single" and the other characters are transition matrix ids and senone ids.

    However, if you want to build WFST cascade from the model, it's kind of an issue how to embed the word position into context-dependent part of the cascade. My solution was to ignore position. You can ignore position in already prebuilt model since differences caused by word position are small, but to do it consistently it's better to retrain word-position-independent model.

    Since of today you can do this easily, mk_mdef_gen tool supports -ignorewpos option which you can set in scripts. Basically everything is counted as an internal triphone. My tests show that this model is not worse than the original one. At least for conversational speech. Enjoy.

    P.S. Want to learn more about WFST - read Paul Dixon's blog http://edobashira.com and Josef Novak's blog http://probablekettle.wordpress.com

    Openfst troubleshooting

    A bit of openfst troubleshooting when you try to build WFST with Juicer. Say you are running


    fstcompose ${OUTLEXBFSM} ${OUTGRAMBFSM} | \
    fstepsnormalize | \
    fstdeterminize | \
    fstencode --encode_labels - $CODEX | \
    fstminimize - | \
    fstencode --decode - $CODEX | \
    fstpush --push_weights | \
    fstarcsort

    and get this


    FATAL: StringWeight::Plus: unequal arguments (non-functional FST?)


    Huh? Which arguments are not equal? What caused this? How to fix this? Definitely it should be more self-explaining. That's basically quite a common issue. You get just a short message that nobody including the author could understand. Go find out how to fix it.



    Looking on the waves

    Here is the question - a perfectly looking sound file which is transcribed with 10% accuracy. Sounds crazy, isn't it? Click on it to enlarge. No noise, no accent.



    Because of that I'm looking on state-of-art in channel normalization, especially for non-linear channel distortions. No good solution yet, I've only found the description of the problem in very old paper



    There is CDCN normalization, few CMN improvements, RASTA and even recently invented HN normalization. CDCN is suprisingly available in Sphinxtrain but nobody uses it. Well it gives no improvement but it's an interesting approach worth to document one day. The idea to collect statistics from the speech to apply it later sounds nice.

    There are model-level approaches, various feature transforms, adaptations. They do not really look that attractive. Most papers now deal with channel compensation for speaker recognition, not speech recognition. I must admit the topic is too large to overview it in few weeks.

    Luckily, I can also spend time looking on the waves like the one on the right. Somewhat more pleasant I would say.


    Some more optimization

    In addition two the previous post, two more tricks for log_diag_eval.

    Floats instead of double

    If accumulator is float, SSE could be used more effectively

    Hardcode vector length

    The most common optimizaition is loop unrolling. It helps to optimize memory access as well as eliminates jump commands. But the issue here is that number of iterations in log_diag_eval can be different on various stages. GCC has interesting profile-based optimizaition for this case, see -fprofile-generate option. It runs a program and then can derive few specific optimizations form the runtime. Good point is that we actually can be almost sure in usage patters of the our target loop, so we can optimize without profiling. So, turn


    for (i=0;i<veclen;i++) {
       do work
    }


    to

    if (veclen == 40) { // Common used value, 40 floats in each frame
        for (i=0;i<40;i++) {
            do work // This will be unrolled
       } else {
        for (i=0;i<veclen;i++)
            do work
        }
    }



    GCC does same trick with profiler, but since our feature frame size is fixed, we can hardcode. As a result GCC will unroll first loop and it will be fast as a wind

    Optimization in SphinxTrain

    I spend quite significant amount of time training various models. It feels like alchemy, you add this and tune there and you get nice results. And while training you can read twitter ;) I'm also 10 years in a group which is creating optimizing compilers so in theory I should know a lot about them. I rarely apply it in practice though. But being bored with several weeks training you can apply some knowledge here.