Mary TTS 4.3.0 released

With Russian voice from Voxforge DB. Yay! Try it on the web:

http://mary.dfki.de:59125/

Mary is definitely superious system comparing to Festival. Graphical UI, modern design, support for various things like automatic dictionary creation make it really easy to build a language support. And due to modular and stable codebase one can easily add support for new feature, integrate with external package like it's done with OpenNLP or just fix the bug. And your fix will be accepted!

There are two Voxforge TTS datasets pending between - German and Dutch and also there is a Polish voice. If anyone wants to try that, it must be really easy to add other language to Mary.

Phoneset with stress

So I finally finished testing of the stress-aware model. It took me few month and the end I could say that lexical stress is definitely better. It provides better accuracy and, more importantly, more robustness over model which has non-stressed phoneset.

I hope we retrain all other models we have with the phoneset with stress. It's great that CMUDict provides enough information to do that. The story of me testing that was quite interesting. I believed in stress for a long time but wasn't able to prove that. In theory it's clear why it helps, when speech speed changes, stressed syllables remain less corrupted than unstressed and we get better control over data. Additional information like lexical stress is important. Of course the issue is the increased number of parameters to train the model. That's why I think early investigations concluded that phoneset without stress is better. Discussion about it on cmusphinx-devel this summer also confirmed Nuance moved to the model with stress in their automotive decoder.

It's interesting how long I tested that. I made numerous attempts and each one had bugs

  • First attempt was using bad features (adapted for 3gp) and didn't show any improvement
  • Number of senones in second training was too small since I didn't know the reason of first failure
  • Third attempt had issue with the automatic questions which were used accidentally instead of manual ones I wrote and it went unnoticed
  • Fourth attempt was rejected because there were issues with the dictionary format in Sphinx4. Never use FastDictionary between, use FullDictionary. Fast dictionary expects specific dictionary format with variants like (2) (3) (4) and not (1) or (2) and (4).
  • Only fifth attempt was good but in shown improvement only on big test set and not on the small one

So basically to check every fact you need to be very careful and double- or triple-check everything. Bugs are everywhere, in language model training, decoder, trainer, configuration. From run to run bugs could lead to different results, even a small change can break everything. I think optimal way for research could be to check the same proposition in independent teams using independent decoders and probably different data. Not sure if it's doable in short term.