Very simple but very important thing to properly model the language

If I would be a scientific advisor I would give my student the following problem:

Take a text, take an LM, computer perplexity:

file test.txt: 107247 sentences, 1.7608e+06 words, 21302 OOVs 0 zeroprobs, logprob= -4.06198e+06 ppl= 158.32 ppl1= 216.345

Join every two lines in text:
awk 'NR%2{printf "%s ",$0;next}{print;}' test.txt > testjoin.txt

Test again:
file testjoin.txt: 53624 sentences, 1.7608e+06 words, 21302 OOVs 0 zeroprobs, logprob= -4.05859e+06 ppl= 183.409 ppl1= 215.376

This is a really serious issue for decoding of conversational speech, the perplexity raised from 158 to 183, in real-life cases it's getting even worse. WER drops accordingly. So many times utterances contain several sentences and it's really crazy that our models can't handle that properly.

No comments:

Post a Comment