Sphinx4 Powers Contemporary Art

Did you think that sphinx4 could be only used to build another keyboard, help you to track sales manager blaming the product or transcribe medical dictation? Working with computers on daily basis one starts to consider them as a tool.  I was thinking this way not taking into account the fact that speech act itself powered by computers has probably sacral meaning. Communication was the thing that created our mind, and keyboards aren't important when we create communication systems.

The thing that pushed me to this is Heather Dewey-Hagborg's blog. In particular it was the Listening Post, an artistic thing from the CEPA gallery. If you are interested, please also check Heather's interview on BTR Radio. Check also the gallery's site.

And important point here is that I think we should not consider this as some kind of futurizm - talking computers, HAL and all that stuff. Instead, such things help us to change ourselves, change our vision of the world around. Probably next time you'll look on sphinx4 sources from a different point of view.

Great Overview Article

Today Dr. Tony Robinson gave me a present by mentioning this great article on comp.speech.research

Janet M. Baker, Li Deng,
James Glass, Sanjeev Khudanpur,
Chin-Hui Lee, Nelson Morgan, and
Douglas O’Shaughnessy

Research Developments and Directions in Speech Recognition and Understanding, Part 1
Research Developments and Directions in Speech Recognition and Understanding, Part 2

This article was MINDS 2006–2007 Report of the Speech Understanding Working Group,” one of five reports emanating from two workshops titled “Meeting of the MINDS: Future Directions for Human Language Technology,” sponsored by the U.S. Disruptive Technology Office (DTO). For me it was striking that spontaneous events are so important, I never thought about them from this point of view.

The whole state of things is also nicely described in Mark Gales talk Acoustic Modelling for Speech Recognition: Hidden Markov from Models and Beyond? The picture on the left is taken from it.

Blizzard Challenge 2010

Since I was in TTS for a long time and still interested in in, I've been waiting a long for this - Blizzard Challenge team is ready to accept speech expert and volunteer listeners for the Blizzard Challenge 2010.

The challenge was devised in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.

After evaluation participants submit papers where they describe the methods used and problems solved. You could find more information on the webpage http://festvox.org/blizzard/

KISS Principle

Still think that you can take sphinx4 engine and make a state-of-art recognizer? Check what AMI RT-09 entry is doing for meeting transcription in presentation on RT'09 workshop "The AMI RT’09 STT and SASTT Systems":

  1. Segmentation
  2. Initial decoding of full meeting with

    • 4g LM based on 50K vocabulary and weak acoustic model (ML) M1
    • 7g LM based on 6K vocabulary and strong acoustic model (MPE) M2
  3. Intersect output and adapt (CMLLR)
  4. Decode using M2 models and 4gLM on 50k vocabulary
  5. Compute VTLN/SBN/fMPE
  6. Adapt SBN/fMPE/MPE models M3 using CMLLR
  7. Adapt LCRCBN/fMPE/MPE models M4 using CMLLR and output of previous stage
  8. Generate 4g lattices with adapted M4 models
  9. Rescore using M1 models and CMLLR + MLLR adaptation
  10. Compute Confusion networks
Click on image to check the details of the process.