One of them is the phone-oriented segmenter that splits the words on chunks - graphones. Graphone is a joint object consisting of letters and corresponding phones that combine words. Graphones are used in g2p internally, but for example they are very useful in construction of the open vocabulary models. The system as a whole is described here:
Open Vocabulary Spoken Term Detection Using Graphone-Based Hybrid recognition System by M. Acbacak, D. Virgyri and A. Stolke
and the details of the language model in the original article:
Open Vocabulary Speech Recognition with Flat Hybrid Models by Maximilian Bisani and Hermann Ney
The interesting thing is that all required components are already available, the issue is to find correct option and build the system. So the quick reciept is:
1. Get Sequitur G2p
2. Patch it to support Python 2.5 (replace elementtree with xml.etree, since elementtree is deprecated now)
3. Convert cmudict lexicon to xml-based Bliss format (I'm not sure what's it, I failed to find information about it on the web)
import sys import string print "<lexicon>" file = open(sys.argv, "r") for line in file: toks = line.strip().split() if len(toks) < 2: continue word = toks phones = string.join(toks[1:]," ") print "<orth>" print word print "</orth>" print "<pron>" print phones print "</pron>" print "</lexicon>"
4. Train the segmenter model. The most complicated thing is to figure option to train multigram model with several phones. Default one used in g2p consist of 1 phone and 1 letter, it's not suitable for OOV language model.
g2p.py --model model-1 --ramp-up --train cmudict.0.7a.train --devel 5% --write-model model-2 -s 0,2,0,2
5. Ramp up the model to make it more precise
6. Build the language model, here you need the dictionary in XML format. As the article above describes, the original lexicon should be around 10k, the subliminal training lexicon should be 50k or so.
makeOvModel.py --order=4 -l cmudict.xml --subliminal-lexicon=cmudict.xml.test -g model-2 --write-lexicon=res.lexicon --write-tokens=res.tokens
After that you can get a tokens for lm and with additional options even a counts for the language model you could train with SRILM. I haven't finished the previous step yet, so this post should have follow up.