nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

Do you trust speech transcription in the cloud

Selected Papers Interspeech 2019 Wednesday

A Highly Efficient Distributed Deep Learning System for Automatic Speech Recognition
Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny
Cool merge graphs

Detection and Recovery of OOVs for Improved English Broadcast News Captioning
Samuel Thomas (IBM Research AI), Kartik Audhkhasi (IBM Research AI), Zoltan Tuske (IBM Research AI), Yinghui Huang (IBM Research AI), Michael Picheny (IBM Research AI)
Nothing new but still important

Disfluencies and Human Speech Transcription Errors
Vicky Zayats (University of Washington), Trang Tran (University of Washington), Courtney Mansfield (University of Washington), Richard Wright (University of Washington), Mari Ostendorf (University of Washington)

Robust Sound Recognition: A Neuromorphic Approach
Jibin Wu (National University of Singapore), Zihan Pan , Malu Zhang , Rohan Kumar Das , Yansong Chua , Haizhou Li
Spiking neural networks

Neural Named Entity Recognition from Subword Units
Abdalghani Abujabal (Max Planck Institute for Informatics), Judith Gaspers (Amazon)
Names recognition is still important

Unsupervised Acoustic Segmentation and Clustering using Siamese Network Embeddings
Saurabhchand Bhati (The Johns Hopkins University), Shekhar Nayak (Indian Institute of Technology Hyderabad), Sri Rama Murty Kodukula (IIT Hyderabad), Najim Dehak (Johns Hopkins University)

Acoustic Model Bootstrapping Using Semi-Supervised Learning
Langzhou Chen (Amazon Cambridge office), Volker Leutnant (Amazon Aachen office)

Bandwidth Embeddings for Mixed-bandwidth Speech Recognition
Gautam Mantena (Apple Inc.), Ozlem Kalinli (Apple Inc), Ossama Abdel-Hamid (Apple Inc), Don McAllaster (Apple Inc)

Towards Debugging Deep Neural Networks by Generating Speech Utterances
Bilal Soomro (University of Eastern Finland), Anssi Kanervisto (University of Eastern Finland), Trung Ngo Trong (University of Eastern Finland), Ville Hautamaki (University of Eastern Finland)
Debugging is very nice idea

A Study for Improving Device-Directed Speech Detection toward Frictionless Human-Machine Interaction
Che-Wei Huang (Amazon), Roland Maas (Amazon.com), Sri Harish Mallidi (Amazon, USA), Bjorn Hoffmeister (Amazon.com)
Nice idea, we covered that before

Deep Learning for Orca Call Type Identification — A Fully Unsupervised Approach
Christian Bergler, Manuel Schmitt, Rachael Xi Cheng, Andreas Maier, Volker Barth, Elmar Nöth
Kinda cool

The STC ASR System for the VOiCES from a Distance Challenge 2019
Ivan Medennikov (STC-innovations Ltd), Yuri Khokhlov (STC-innovations Ltd), Aleksei Romanenko (ITMO University), Ivan Sorokin (STC), Anton Mitrofanov (STC-innovations Ltd), Vladimir Bataev (Speech Technology Center Ltd), Andrei Andrusenko (STC-innovations Ltd), Tatiana Prisyach (STC-innovations Ltd), Mariya Korenevskaya (STC-innovations Ltd), Oleg Petrov (ITMO University), Alexander Zatvornitskiy (Speech Technology Center)
Kaggle type and cool tricks (char based LM), congrats to STC

Continuous Emotion Recognition in Speech – Do We Need Recurrence?
Maximilian Schmitt (ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg), Nicholas Cummins (University of Augsburg), Björn Schuller (University of Augsburg / Imperial College London)

Self-supervised speaker embeddings
Themos Stafylakis (Omilia - Conversational Intelligence), Johan Rohdin (Brno University of Technology), Oldrich Plchot (Brno University of Technology), Petr Mizera (Czech Technical University in Prague), Lukas Burget (Brno University of Technology)
the word of the year

Better morphology prediction for better speech systems
Dravyansh Sharma (Carnegie Mellon University), Melissa Wilson (Google LLC), Antoine Bruguier (Google LLC)

Connecting and Comparing Language Model Interpolation Techniques
Ernest Pusateri, Christophe Van Gysel, Rami Botros, Sameer Badaskar, Mirko Hannemann, Youssef Oualil, Ilya Oparin
Worth to remind

Articulation rate as a metric in spoken language assessment
Calbert Graham (University of Cambridge), Francis Nolan (University of Cambridge)

Selected Papers Interspeech 2019 Tuesday

Spatial and Spectral Fingerprint in The Brain: Speaker Identification from Single Trial MEG Signals 
Oral; 1000–1020
Debadatta Dash (The University of Texas at Dallas), Paul Ferrari (University of Texas at Austin), Jun Wang (University of Texas at Dallas)

Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data 
Jason Fong (University of Edinburgh), Pilar Oplustil (University of Edinburgh), Zack Hodari (University of Edinburgh), Simon King (University of Edinburgh)

Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise 
Poster; 1000–1200
Avashna Govender (The Centre for Speech Technology Research, University of Edinburgh), Anita E Wagner (Graduate School of Medical Sciences, School of Behavioural and Cognitive Neurosciences, University of Groningen), Simon King (University of Edinburgh)

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice 
Poster; 1000–1200
Vikramjit Mitra (Apple Inc.), Sue Booker (Apple Inc.), Erik Marchi (Apple Inc), David Scott Farrar (Apple Inc.), Ute Dorothea Peitz (Apple Inc.), Bridget Cheng (Apple Inc.), Ermine Teves (Apple Inc.), Anuj Mehta (Apple Inc.), Devang Naik (Apple)

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment 
Chitralekha Gupta (National University of Singapore), Emre Yilmaz (National University of Singapore), Haizhou Li (National University of Singapore)

STC Antispoofing Systems for the ASVspoof2019 Challenge 
Galina Lavrentyeva (ITMO University, Speech Technology Center), Sergey Novoselov (ITMO University, Speech Technology Center), Tseren Andzhukaev (Speech Technology Center), Marina Volkova (Speech Technology Center), Artem Gorlanov (Speech Technology Center), Alexandr Kozlov (Speech Technology Center Ltd.)

Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages 
Harry Bleyan (Google), Sandy Ritchie (Google), Jonas Fromseier Mortensen (Google), Daan van Esch (Google)

Multilingual Speech Recognition with Corpus Relatedness Sampling
Xinjian Li, Siddharth Dalmia, Alan W. Black, Florian Metze 

On the Use/Misuse of the Term 'Phoneme' 
Roger Moore (University of Sheffield), Lucy Skidmore (University of Sheffield)

Selected Papers Interspeech 2019 Monday

Overall, it is going pretty good. Many very good papers, diarization joins with decoding, everything goes to the right direction.

RadioTalk: a large-scale corpus of talk radio transcripts 
Doug Beeferman (MIT Media Lab), William Brannon (MIT Media Lab), Deb Roy (MIT Media Lab)

Automatic lyric transcription from Karaoke vocal tracks: Resources and a Baseline System 
Gerardo Roa (University of Sheffield), Jon Barker (University of Sheffield)

Speaker Diarization with Lexical Information
Tae Jin Park, Kyu J. Han, Jing Huang, Xiaodong He, Bowen Zhou, Panayiotis Georgiou, Shrikanth Narayanan

Full-Sentence Correlation: a Method to Handle Unpredictable Noise for Robust Speech Recognition 
Ming Ji (Queen's University Belfast), Danny Crookes (Queen's University Belfast)

Untranscribed Web Audio for Low Resource Speech Recognition
Andrea Carmantini, Peter Bell, Steve Renals 

Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data
Manasa Prasad, Daan van Esch, Sandy Ritchie, Jonas Fromseier Mortensen 

How to annotate 100 hours in 45 minutes 
Per Fallgren (KTH Royal Institute of Technology), Zofia Malisz (KTH, Stockholm), Jens Edlund (KTH Speech, Music and Hearing)

Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition 

High quality - lightweight and adaptable TTS using LPCNet 
Zvi Kons (IBM Haifa research lab), Slava Shechtman (Speech Technologies, IBM Research AI), Alexander Sorin (IBM Research - Haifa), Carmel Rabinovitz (IBM Research - Haifa), Ron Hoory (IBM Haifa Research Lab)
Very nice quality

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition
Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Björn W. Schuller

Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition 
Khoi-Nguyen Mac (University of Illinois at Urbana-Champaign), Xiaodong Cui (IBM T. J. Watson Research Center), Wei Zhang (IBM T. J. Watson Research Center), Michael Picheny (IBM T. J. Watson Research Center)

An Investigation into On-Device Personalization of End-to-End Automatic Speech Recognition Models
Khe Chai Sim, Petr Zadrazil, Françoise Beaufays

Information flows of the future

It is interesting how similar ideas raise here and there in seemingly unrelated context. The recent quote from Actionable Book Summary: The Inevitable by Kevin Kelly:
And what’s next probably looks like this: Imagine zillion streams of information interacting with each other, communicating, pulsating. A new type of computer, tracking and recording everything we do. The future will be less about owning stuff and more about being part of flowing information that will supposedly make our lives easier.
Compare to the quote from Secushare draft:
secushare employs GNUnet for end-to-end encryption and anonymizing mesh routing (because it has a more suitable architecture than Tor or I2P) and applies PSYC on top (because it performs better than XMPP, JSON or OStatus) to create a distributed social graph.

The masking problem - capsules, specaug, bert

An important issue with a modern neural networks is their vulnerability to the masked corruption, that is the random corruption of some small amount of samples in the image or sound. It is well known that human is very robust about such noise, a man can ignore slightly but randomly corrupted pictures, sounds, sentences. MP3 compression is using masking to drop unimportant bits of sounds. Random impulse background noise usually has little effect on speech recognition by human. On the other hand it is very easy to demonstrate that modern ASR is extremely valuable to random spontaneous noise and that really makes a difference, even slight change of some frequencies can harm the accuracy a lot.

Hinton understood this problem and that is why he proposed capsule networks as a solution. The idea is that by using an agreement between a set of experts you can get more reliable prediction ignoring unreliable parts. Capsules are not very popular yet, but they were exactly thought to solve the masking problem.

On the other hand, Google/Facebook/OpenAI tried to solve the same problem with more traditional networks. They still use deep and connected architectures, but they decided to corrupt the dataset with masks during the training and teach the model to recognize it. And it does work well too, for example, remember SpecAugment success in speech recognition, BERT/ROBERTA/XLM in NLP are very good examples too.

On the path to reproduce this idea it is important to understand one thing. Since neural network effectively memorizes the input, to properly recognize masked images trainer has to see all possible masks and has to store their vectors in a network. That means - training process has be much much longer and the network has to be much much bigger. We see that in BERT. Kaldi people also saw it when they tried to reproduce SpecAugment.

Given that, some future ideas:

1. SpecAugment is not really random masking, it either drops the column or the raw. I predict more effective masking would be to randomly drop the 15% of the values on the whole 2-d spectrum, something Bert-style. I think in the near future we shall see that idea implemented.

2. The idea of masking can be applied to other sequence modeling problems in speech, for example,  in TTS, we shall see it soon in vocoders and in transformer/tacotron models.

3. The waste of resources for training and decoding with masking is obvious, a more intelligent architecture to recognize masked inputs might change the things significantly.

Thanks to Robit Mann on @cmusphinx for the initial idea.

The theory of possibilities

I've got quite interested in the future prediction these days, one nice idea by Russian writer Sergey Borisovich Pereslegin is that we should build the future based on the theory of possibilities rather than the theory of probabilities. This is a very deep idea actually.

The probability theory is very common these days and everyone is applying Bayes things here and there. But the problem with probability theory it can only predict probable thing which are known before or have been observed before.

The theory of possibilities can discover new unknown things.

This is quite a researched subject surprisingly, for example one can check
Possibility Theory and its Applications: Where Do we Stand ? by Didier Dubois and Henri Prade

Goodbye Google+

Dear friends, as you know Google+ is shutting down. I considered several alternatives: Facebook, Quora, Linkedin, my old blog, Reddit, Twitter, Telegram. Unfortunately there are things I dislike in all of them.

In the age of big data we can certainly confirm that big data is just a big trash. The real ideas are always hidden from public discussions under the layers of disinformation. Incognito technologies are the ones that going to rule in the next technological age. The platform to support hidden data is not going to be public anyway.

Finally, my own stupid content is nothing but ramblings, I just translate the ideas I see around the web. It is way more interesting to read about my friends and colleagues. Not just technology news, but tiny personal things, even opinionated are of great importance and interest. Very sad to loose such a place that Google+ was. Ideally it would be nice to meet and discuss in person but its not always easy.

So it would be certainly great if you'll join few of the sites below and share your ideas and knowledge, I hope it will be beneficial for all of us.

https://t.me/speech_recognition - the new Telegram channel about speech recognition. I like Telegram for the UI simplicity and speed, for the elegance of technical solutions and extension capabilities. I'm also quite excited of Russian origins of Telegram, supposedly the product of Russian intelligence agencies.

https://www.quora.com/q/usejrrgnezvhiyup - the space on Quora. I find some content on Quora quite offensive, but I also find many extremely interesting answers there from really nice people. I also find that Quora is very helpful to establish new connections and promote the ideas.

https://www.linkedin.com/groups/8614109/ - over the years I found Linkedin extremely useful for business and establishing connections. They screwed group discussions, they screwed the UI, they lost many opportunities, but they still remain the top business network. Hopefully they will catch up with the issues they have. I hope the group is going to be useful channel to get in touch and learn more about recent developments.

I also hope to continue with the blog, update our company website and continue with the developments. More on that in the future.

Please let me know what is your opinion.