I've recently attendted ICASSP 2012 conference in Kyoto, Japan. As expected it was an amazing experience. Many thanks to organizers, technical program committee and the reviewers for their hard work.
The conference collected more than a thousands experts in signal processing and speech recognition. The total number of submitted papers was more than 2000 and more than 1300 of them were presented. It's enormous amount of information to process and it was really helpful to be there and see everything yourself. Of course, most importantly it's an opportunity to meet the people you work with remotely and talk about speech recognition in person. We talked quite a lot about Google Summer Of Code Project we will run soon. You can expect very interesting features implemented there. It's so helpful to map virtual characters to real people.
Given the amount of papers and data I think it's critically important to summarize the material or at least to provide some overview on the results presented. I hope that future organizers will fill that gap. And for now here is a not very long list of papers and topics I found interesting this year.
Plenary talksFirst of all I very much liked two plenary session I attended. The talk by Dr. Chin-Hui Lee was about better acoustic model tools. Though the neural networks doesn't seem to provide a good accuracy the main idea was that without good acoustic model you can not get a good accuracy. The only problem with all approaches like this unfortunately is that they are performed on carefully prepared TIMIT database, thus on perfectly clear speech. Everything gets completely different when you move to the area of spontaneous noisy speech we are usually working on in practical tasks.
Second talk by Dr. Stephane Mallat was about math ideas in machine learning and recognition tasks. Though not directly related to speech it was talking about wavelets and mathematical invariants. If properly developed such theory could build a very good foundation for the accurate and most importantly the optimal proven speech recogntion.
Discriminative language modelsOne new thing for me was several discriminatively trained language models papers. It seems that most of this work is done using neural networks framework, but I think it could be generalized for training an arbitrary G-level WFST.
DISTRIBUTED DISCRIMINATIVE LANGUAGE MODELS FOR GOOGLE VOICE-SEARCH
Preethi Jyothi, The Ohio State University, United States; Leif Johnson, The University of Texas at Austin, United States; Ciprian Chelba, Brian Strope, Google, Inc., United States
Big DataThis last paper also belongs to a recently emerging big data trend which was quite well represented on the conference. An analog of nuclear physics in our world which require big investment and huge teams. It seems to be in a very initial state still but it must be a very hot topic in a coming years. Most lead by Google team. Well, you can't expect anything else from Google. Another paper from them is also about things hard to imagine.
DISTRIBUTED ACOUSTIC MODELING WITH BACK-OFF N-GRAMS Ciprian Chelba, Peng Xu, Fernando Pereira, Google, Inc., United States; Thomas Richardson, University of Washington, United States
So far Google trains on 87 thousands hours. Imagine that. It didn't help them much yet. They reduced the accuracy from 11% to somewhat like 9.5%.
Big data paper from CMU is interesting too describing the speedup method for training with for the big amount of speech data:
TOWARDS SINGLE PASS DISCRIMINATIVE TRAINING FOR SPEECH RECOGNITION
Roger Hsiao, Tanja Schultz
Importantly that big data idea it turns into the idea that both acoustic and language model are equivalent and should be trained together. The paper about that is:
OPTIMIZATION IN SPEECH-CENTRIC INFORMATION PROCESSING: CRITERIA AND TECHNIQUES
Xiaodong He, Li Deng, Microsoft Research, United States
We could even go further and state that noise parameters and accurate
transcription are also part of the training model thus we need to train
them jointly. Some papers on that subject:
SEMI-SUPERVISED LEARNING HELPS IN SOUND EVENT CLASSIFICATION
Zixing Zhang, Björn Schuller, Technische Universität München, Germany
N-BEST ENTROPY BASED DATA SELECTION FOR ACOUSTIC MODELING
Nobuyasu Itoh, IBM Research - Tokyo, Japan; Tara N. Sainath, IBM T.J. Watson Research Center, United States; Dan Ning Jiang, Jie Zhou, IBM Research - China, China; Bhuvana Ramabhadran, IBM T.J. Watson Research Center, United States
Efficient decodersIf you are interested in efficient decoders, the session on LVCSR was be very interesting. I'd note the following papers:
EXTENDED SEARCH SPACE PRUNING IN LVCSR
David Nolden, Ralf Schlüter, Hermann Ney, RWTH Aachen University, Germany
USING A* FOR THE PARALLELIZATION OF SPEECH RECOGNITION SYSTEMS
Patrick Cardinal, Gilles Boulianne, CRIM, Canada; Pierre Dumouchel, ETS, Canada
The idea is to use fast WFST pass for heuristic score estimation for A*.
JOINING ADVANTAGES OF WORD-CONDITIONED AND TOKEN-PASSING DECODING
David Nolden, David Rybach, Ralf Schlüter, Hermann Ney, RWTH Aachen University, Germany
DBNQuite many DBN papers but I'm not very interested in them. Microsoft traines DBNs on RT03 task and they get pretty good results. 19% WER compared to baseline 25-27%:
EXPLOITING SPARSENESS IN DEEP NEURAL NETWORKS FOR LARGE VOCABULARY SPEECH RECOGNITION
Dong Yu, Microsoft Research, United States; Frank Seide, Gang Li, Microsoft Research Asia, China; Li Deng, Microsoft Research, United States
Reccurent neural networks are also good:
REVISITING RECURRENT NEURAL NETWORKS FOR ROBUST ASR
Oriol Vinyals, Suman Ravuri, University of California at Berkeley, United States; Daniel Povey, Microsoft Research, United States
Weighted Finate State TransducersWhole WFST session was great, in particular I very much liked papers on fillers in WFST as well as the last AT&T paper on uniform discriminative training from WFSTs which gives some insights about the internals of AT&T recognizer.
SILENCE IS GOLDEN: MODELING NON-SPEECH EVENTS IN WFST-BASED DYNAMIC NETWORK DECODERS
David Rybach, Ralf Schlüter, Hermann Ney, RWTH Aachen University, Germany
A GENERAL DISCRIMINATIVE TRAINING ALGORITHM FOR SPEECH RECOGNITION USING WEIGHTED FINITE-STATE TRANSDUCERS
Yong Zhao, Georgia Institute of Technology, United States; Andrej Ljolje, Diamantino Caseiro, AT&T Labs-Research, United States; Biing-Hwang (Fred) Juang, Georgia Institute of Technology, United States
Robust ASRRobust session was striking. PNCC features seem to perform better than verything else. All other authors were plainly saying their method is good but PNCC is better during their talks. Congratulations to Rich Stern, Chanwoo Kim and other involved.
POWER-NORMALIZED CEPSTRAL COEFFICIENTS (PNCC) FOR ROBUST SPEECH RECOGNITION Chanwoo Kim, Microsoft Corporation, United States; Richard Stern, Carnegie Mellon University, United States
Corporations on ICASSPIt's great to see the more involvement of the industry in the research process. I think it's great that major industry players contribute their knowledge to the open shared pool. Honesly, academic activites need to have more influence from the industry too.
Check one paper from recently emerged speech corporation, Apple Inc.
LATENT PERCEPTUAL MAPPING WITH DATA-DRIVEN VARIABLE-LENGTH ACOUSTIC UNITS FOR TEMPLATE-BASED SPEECH RECOGNITION Shiva Sundaram, Deutsche Telekom Laboratories, Germany; Jerome Bellegarda, Apple Inc., United States
And another one from a great small company EnglishCentral.com
DISCRIMINATIVE TRAINING FOR SPEECH RECOGNITION IS COMPENSATING FOR STATISTICAL DEPENDENCE IN THE HMM FRAMEWORK Dan Gillick,
Steven Wegmann, International Computer Science Institute, United States; Larry Gillick, EnglishCentral, Inc., United States
CRF for Confidence EstimationHalf of the confidence papers were dealing with CRF. It's actually a nice idea to exploit the fact that low confidence region usually spans multiple words
CRF-BASED CONFIDENCE MEASURES OF RECOGNIZED CANDIDATES FOR LATTICE-BASED AUDIO INDEXING
Zhijian Ou, Huaqing Luo, Tsinghua University, China
Generic models for ASRThis one is a paper I liked about a long dreamed inclusion of the syllables into the speech recognition model
SYLLABLE: A SELF-CONTAINED UNIT TO MODEL PRONUNCIATION VARIATION
Raymond W. M. Ng, Keikichi Hirose, The University of Tokyo, Japan
I hope this research will go into mainstream one day.