Building a Generic Langauge Model

I spent some time recently building a language model from the open Gutenberg texts, it has been released today:

Unfortunately, it appeared that it's very hard to build a model which is relatively "generic". The language models are very domain-dependent, it's almost impossible to build a good language model for every possible text.  Books are almost useless for conversational transcription, no matter what amount of book texts do you have. And, you need terabytes of data in order to reduce accuracy just 1%.

Still, the released language model is an attempt to do so. More importantly, the source texts used to build the langauge model are more-or-less widely available, so it will be possible to extend and improve the model in the future.

I found it quite interesting to solve this problem of domain-dependence. Despite the common fact that trigram models work "relatively well", in fact they do not. This survey I find very relevant


Brittleness across domains: Current language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained. For example, to model casual phone conversations, one is much better off using 2 million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio news broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model
trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period...
Recent advances in language models for speech recognition include discriminative language models which are impractical to build unless you have unlimited processing power

And recurrent neural-network language model implemented by RNNLM toolkit

Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černocký Jan, Khudanpur SanjeevRecurrent neural network based language model, In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, JP

RNNLM are becoming very popular and they could provide some significant gains in perplexity and decoding WER, but I don't believe they could solve domain-dependency issue to enable us to build a truely generic model.

So, research on the subject is still needed. Maybe, once the domain-dependency variance could be properly captured, a way more accurate langauge model could be built.