Fillers in WFST

Another practical question is - how do you integrate fillers? There is silence class introduced in

A GENERALIZED CONSTRUCTION OF INTEGRATED SPEECH RECOGNITION TRANSDUCERS by Cyril Allauzen, Mehryar Mohri, Michael Riley and Brian Roark

and implemented in transducersaurus.

But you know each practical model has more than just a silence. Fillers like noise, silence, breath, laugh they all go to specific senones in the model. I usually try to minimize them during the training for example joining all them ums, hmms, and mhms into a single phone but I still think they are needed. How to integrate them when you build WFST recognizer?

So I tried few approaches. For example instead of adding just a <sil> class in T transducer I tried to create many branches for each filler. As a result final cascade expands to a huge moster. Like if cascade was 50mb after combination with 1 silence class it is 100mb but after 3-4 classes it's 300mb. Not a nice thing to do.

So I ended in dynamic expansion of silence transitions like this:

if edge is silence:
   for filler in fillers:
      from node.add_edge(filler)

This seems to work well.

4 comments:

Post a Comment