nsh - Speech Recognition With CMU Sphinx

Saturday, November 7, 2009

Rhythm of British English in Festival

Interesting how ideas rise from time to time in seemingly unrelated places. Recently I've read nice post in John Well's blog about the proper RP English rhythm and now that issue raised again in gnuspeech mailing list where Dr. Hill cited his work

JASSEM, W., HILL, D.R. & WITTEN, I.H. (1984) Isochrony in English speech: its statistical validity and linguistic relevance. Pattern, Process and Function in Discourse Phonology (collection ed. Davydd Gibbon), Berlin: de Gruyter, 203-225 (J)

I spend some time thinking about how this rhythm is handled in Festival and came to the conclusion there is no such entity there. Probably it's somehow handled by CART for duration and intonation prediction, but not as a separate entity. Though many voices are supposed to be US English, I still think they can benefit from a proper rhythm prediction. Try the example from the movie, "This the house that Jack built" with artic voices. Check if Jack gets enough stress.

Sunday, October 18, 2009

Blizzard 2009 results available

It was pleasant to find out that results of the Blizzard Challenge 2009 are now available. Thanks a lot ot organizers and participants!

Reading the articles took me half of the day trying to solve usual Einstein-type puzzle of figuring out who give the best results there and what was changed. Unfortunately it takes to much time to read everything in details. There is no summary on methods/systems used this year, the archivements from the last year and explanations of the results provided. I could only start with the following:

  1. iFlytek Speech Lab and IVO Software are still the best. Unit selection systems win.
  2. DFKI which I was fan of can't unfortunately jump to a commercial level even with unit selectoin. That probably means that not only unit selection is a key issue.
  3. I like the progress muXac and Mike are doing over years.
  4. ES3 task with building voice from small amount of speech is kind of senseless. Don't we want to use voice adaptation in this case 
  5. Interesting that machine learning for join and target cost optimization is popular nowdays
  6. Though there was telephone TTS task it seems for me that nobody did anything related to the TTS over the telphone lines. The differences shouldn't be large, only 8kHz is the issue or even the advantage, but even this moment is not covered in any articles or at least I didn't notice it.
Short summary on systems:
  • Aholab - unit selection, spent one day on building the voice so nothing good to expect
  • WISTON - Mandarin prosody is a key feature, but article doesn't describe challenge
  • Cereproc - experiment with combining HTS and unit selection, bad results or unknown reason, 4 man-days spent
  • CMU - article is not available, but you can try clustergen yourself in stock festival
  • CSTR - CSTR has started investigations on HTS methods. Good start, no results yet.
  • DFKI - spent year on adding Turkish TTS and Mary 4.0 implementation
  • Edinburgh/Idiap - interesting unsupervised entry, results are obvioulsy lower
  • I2R - good TTS, unit selection
  • Ivona - unit selection with pitch modifications by interestingly named algorithm, best English one together with iFlytek
  • CircumReality - unit selection with pitch modification by TD-PSOLA, best progress over years
  • NICT - HTS, GV, MGE and a lot of math
  • NIT - HTS with STRAIGHT, best HTS here, best Mandarin as well
  • NTUT - Mandarin HTS, not so interesting
  • PKU - Another Mandarin HTS with STRAGHT
  • Toshiba - Good unit selection system, interesting method about fuzzy combining units.
  • iFlytek - HMM-driven unit selection, best English one together with Ivona.
  • VUB - unit selection with WPSOLA, average, though interesting link on SPRAAK open source recognition toolkit, which is not completely open but has interesting description.
Still, the challenge itself is very interesting and I'm looking forward on the next challenge results.

Friday, October 16, 2009

Another cool bit if hardware for database training.

It's sometimes hard to adopt quickly the new opportunities world provide. I'm being reading now Innovator's Dilemma by Clayton M. Christensen. Thanks to Ellias for the advice, it really seems like a good book.

The interesting thing is that author starts with a description of hard drive industry as the fastest one with innovations going faster than customer needs. And, what do you think? Hard drive industry strikes back with SSD drives. Well, I read they exist but didn't understand their value for acoustic model training. Even without profiling it's clear they will be extremely useful. 

Say you have a medium size acoustic database of 60 hours of few gigabytes size. If you want to process it fast you need to use 8-core machine. Here comes the bottleneck, imagine 8 processes reading the feature vectors from a disk in an almost random way. No need to guess hard drive will be very busy trying to fetch all data required. SSD could definitely help here, I really need to try it soon.

Tuesday, September 29, 2009

CMU Sphinx Users and Developers Workshop 2010

I'm happy to announce

The First CMU Sphinx Workshop

20 March 2010, Dallas, TX, USA

Event URL: http://www.cs.cmu.edu/~sphinx/Sphinx2010

Papers are solicited for the CMU Sphinx Workshop for Users and Developers (CMU-SPUD 2010), to be held in Dallas, Texas as a satellite to to ICASSP 2010.

CMU Sphinx is one of the most popular open source speech recognition systems. It is currently used by researchers and developers in many locations world-wide, including universities, research institutions and in industry. CMU Sphinx's liberal license terms has made it a significant member of the open source community and has provided a low-cost way for companies to build businesses around speech recognition.

The first SPUD workshop aims at bringing together CMU Sphinx users, to report on applications, developments and experiments conducted using the system. This workshop is intended to be an open forum that will allow different user communities to become better acquainted with each other and to share ideas. It is also an opportunity for the community to help define the future evolution of CMU Sphinx.

We are planning a one-day workshop with a limited number of oral presentations, chosen for breadth and stimulation, held in an informal atmosphere that promotes discussion. We hope this workshop will expose participants to different perspectives and that this in turn will help foster new directions in research, suggest interesting variations on current approaches and lead to new applications.

Papers describing relevant research and new concepts are solicited on, but not limited to, the following topics. Papers must describe work performed with CMU Sphinx:
  • Decoders: PocketSphinx, Sphinx-2, Sphinx-3, Sphinx-4
  • Tools: SphinxTrain, CMU/Cambridge SLM toolkit
  • Innovations / additions / modifications of the system
  • Speech recognition in various languages
  • Innovative uses, not limited to speech recognition
  • Commercial applications
  • Open source projects that incorporate Sphinx
  • Novel demonstrations
Manuscripts must be between 4 and 6 pages long, in standard ICASSP double-column format. Accepted papers will be published in the workshop proceedings.

Important Dates

Paper submission: 30 November 2009
Notification of paper acceptance: 15 January 2010
Workshop: 20 March 2010

Organizers

Bhiksha Raj - Carnegie Mellon University
Evandro Gouvêa - Mitsubishi Electric Research Labs
Richard Stern - Carnegie Mellon University
Alex Rudnicky - Carnegie Mellon University
Rita Singh - Carnegie Mellon University
David Huggins-Daines - Carnegie Mellon University
Nickolay Shmyrev - Nexiwave
Yannick Estève - Laboratoire d'Informatique de l'Université du Maine

Contact

To email the organizers, please send email to sphinx+workshop@cs.cmu.edu

Saturday, September 26, 2009

Using HTK models in sphinx4

As from yesterday long waited cool patch by Christophe Cerisara with the help of super fast Yaniv Kunda has landed in svn trunk. Now you can use HTK model directly from sphinx4. Though it's not easy since I spend a few hours today figuring the required issues, so here is a little step-by-step howto:

1. Update to sphinx4 trunk

2. Download small model, because currently binary loading is not supported unfortunately and it takes a lot of resources to load the model from a huge text file. Get a model from Keith Vertanen

http://www.inference.phy.cam.ac.uk/kv227/htk/htk_wsj_si84_2750_8.zip

3. Convert model to text format with HTK HHEd


mkdir out
touch empty
HHEd -H hmmdefs -H macros -M out empty tiedlist


4. Replace model in Lattice demo in configuration file:


<component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="wsjLoader">
<property name="unitManager" value="unitManager">
</property>
<component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.HTKLoader">
<property name="logMath" value="logMath">
<property name="modelDefinition" value="/home/shmyrev/sphinx4/wsj/out/hmmdefs">
<property name="unitManager" value="unitManager">
</property>


Please note here that modelDefinition property points to the location of the newly created hmmdefas file.

5. Replace the frontend configuration to load HTK features from a file. Unfortunately it's impossible to create HTK features with sphinx4 frontend right now, but this will be implemented soon I hope. Some bits are already present like DCT-II transform with frontend.transform.DiscreteCosineTransform2, some are easy to setup like proper filter coefficients, some are missing. So for now we'll recognize MFC file instead.


<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item> streamHTKSource </item>
</propertylist>
</component>
<component name="streamHTKSource" type="edu.cmu.sphinx.frontend.util.StreamHTKCepstrum">
<property name="cepstrumLength" value="39"/>
</component>


and let's change the Java file


StreamHTKCepstrum source = (StreamHTKCepstrum) cm.lookup ("streamHTKSource");
InputStream stream = new FileInputStream(new File ("input.mfc"));
source.setInputStream(stream);


6. Now let's extract mfc. Create a config file for HCopy


SOURCEFORMAT = WAV
TARGETKIND = MFCC_D_A_Z_0
TARGETRATE = 100000.0
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T
ZMEANSOURCE = T
USEPOWER = T


and run it


HCopy -C config 10001-90210-01803.wav input.mfc


make sure input.mfc is located in top sphinx4 folder now since this is the place we'll take it.

7. Now everything is ready


ant && java -jar bin/LatticeDemo.jar


Check the result


I heard: once or a zero zero one nine oh to one oh say or oil days or a jury


It's not very precise, but still ok for such a small model and limited language model.

This is still a work in progress and a lot of things still pending. The most important are reading the binary HTK files, frontend adaptation, cleanup and unification. But I really look forward on the results, since it's really a promising approach. There are not so many BSD-licensed HTK decoders out there.

Wednesday, September 16, 2009

Speech Recognition As Experimental Science

It's well known there are two types of physics - theoretical one and experimental. During the school I always liked doing the last, measuring the speed of a ball or voltage, plotting the graphics and so on. Unforunately in later days I was mostly doing math or programming. Only recently when I started to spend a lot of time on speech recognition I found why do I like it so much - it's also an experimental science.

When you build a speech recogniton system your time is mostly spent on all these beautiful things. Setting up the database training, running the learning process, tracking the results. You are trying understand the nature and find it's laws, you want to find the best feature set, phoneset, find the beams and more and more. You have an experimental material and sometimes it appeared there are things you forget to take in account. The activity that's really encouraging.

Of course there are important drawbacks, issues like proper design of the experiments arise. Unfortunately it's not widely described in the literature but speech recognition experiments are just an examples of experiments so all issues are valid for them. To list a few:
  • Reproducability
  • Connection of the theory and the practice
  • Estimation of the results and their validity
For example the last point is very important. Currently when we are running the the database test we just get a number. We are trying to rely on it without even estimating the deviation and other very important attributes of every scientific measurement. As the result we make unreliable decisions like I did with MLLT transform. I now think that we should be more careful about that.

So that's why I started with the forementioned wikipedia page trying to find a good book on experiment design and of course it would be nice to find an appropriate software for experiment management workflow.

Thursday, September 10, 2009

The First Glance On The Interspeech 2009 Papers

Interspeech 2009 in Brighton is over today. Unfortunately I wasn't able to particiapte for various reasons. Still, it was very interesting to review the list of sessions, abstracts and read some articles available. The modern activity in speech research is amazing, the number of articles and groups is enormous, in total I counted 459 abstracts with grep. It was enjoying to process them all. Currently I reduced the list to 50% of the original size so still need a few lookups to find something more interesting. A few random thoughts I've got:

Sphinx is mentioned 2 times and HTK only once :), that's a win. Of course many researches use HTK for experiments. So it's more the win in being more open.

A lot of machine learning research. And quite a significant amount of research is dedicated to another target space representation/classifier/cost function adjustments. The first glance didn't show anything interesting here unfortunately. Discriminative training is probably the most recent advance in ASR.

Still enormous amount of the old style phontic research. Is vowel length a feature? How do Zulu people click? Sometimes it's interesting to read though.

Almost all TTS is about HMM for speech synthesis. The quality of audio for TTS is a problem. I've recenly read the good and very detailed good review by Dr. Zen, even adepts of the approach know that the hybrid of HMM and unit-selection is better.

Suprisingly short section on new methods and paradigms unfortunately.

New trends include emotions, machine speech-to-speech translation, language aquisition. Combination of visual and speech recognition is suprisingly common.

No Russians at all. Well, not strange, Russian speech technology doesn't exist in fact.

The RWTH Aachen University Open Source Speech Recognition System is a terrific news. The source is available, downloaded and ready for investigation.

"Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate", no link available yet unfortunately. Should be a very interesting reading. The only problem that arises here is that someone should do the merge. The issue is that source is available but really it's very hard to integrate with the research-oriented system.

I'm also waiting for Blizzard 2009 results that should be presented but still not available.

A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game - we wanted that for a long time for Voxforge.

In few next posts I'll probably cover some interesting topics in more detail. If you was at the conference or saw something interesting, comments are appreciated.

Blog Archive