Training language model with fragments

Sequitur g2p by M. Bisani and H. Ney. is a cool package for the letter to phone translation, quite accurate and, the most important, open. But actually there are different hidden gems in this package :)

One of them is the phone-oriented segmenter that splits the words on chunks - graphones. Graphone is a joint object consisting of letters and corresponding phones that combine words. Graphones are used in g2p internally, but for example they are very useful in construction of the open vocabulary models. The system as a whole is described here:

Open Vocabulary Spoken Term Detection Using Graphone-Based Hybrid recognition System by M. Acbacak, D. Virgyri and A. Stolke

and the details of the language model in the original article:

Open Vocabulary Speech Recognition with Flat Hybrid Models by Maximilian Bisani and Hermann Ney

The interesting thing is that all required components are already available, the issue is to find correct option and build the system. So the quick reciept is:

1. Get Sequitur G2p
2. Patch it to support Python 2.5 (replace elementtree with xml.etree, since elementtree is deprecated now)
3. Convert cmudict lexicon to xml-based Bliss format (I'm not sure what's it, I failed to find information about it on the web)

import sys
import string
print "<lexicon>"
file = open(sys.argv[1], "r")

for line in file:
toks = line.strip().split()
if len(toks) < 2:
word = toks[0]
phones = string.join(toks[1:]," ")
print "<orth>"
print word
print "</orth>"
print "<pron>"
print phones
print "</pron>"
print "</lexicon>"

4. Train the segmenter model. The most complicated thing is to figure option to train multigram model with several phones. Default one used in g2p consist of 1 phone and 1 letter, it's not suitable for OOV language model. --model model-1 --ramp-up --train cmudict.0.7a.train --devel 5% --write-model model-2 -s 0,2,0,2

5. Ramp up the model to make it more precise
6. Build the language model, here you need the dictionary in XML format. As the article above describes, the original lexicon should be around 10k, the subliminal training lexicon should be 50k or so. --order=4 -l cmudict.xml --subliminal-lexicon=cmudict.xml.test -g model-2 --write-lexicon=res.lexicon --write-tokens=res.tokens

After that you can get a tokens for lm and with additional options even a counts for the language model you could train with SRILM. I haven't finished the previous step yet, so this post should have follow up.

I'm going to ClueCon

This August I'm going to US again to Chicago to ClueCon where I'll give the talk titled "The use of open source speech recognition". Here is the small outline:

The most complicated thing in modern ASR is to make user expectations agree with the actual capabilities of the technology. Although the technology itself is able to provide a number of potentially very useful features, they are not exactly what average user expects.

Many specialized tasks require a huge amount of customization, for example speaker adaptation needs to be accurately embedded into the accounting system in order to let recognizer improve the accuracy.

The open source solutions could help here because of much greater flexibility they have. But although many companies provide speech recognition services only several projects exist and most of them are purely academic. They often require a lot of tuning for the end-user. Many parts of the complete system are just missing.

Luckily the situation is going to improve during last years, the core components are going to have more or less stable release schedule and active support including a commercial one.

The purpose of this talk is to cover the trends of the development of open source based speech recognition in conjunction with the telephony systems and suggest a ways it can reach enterprise level.

I'll also visit Boston for two days

Update: Here is the presentation

Gran Canaria

I'm on Gran Canaria!