Magic Words of Interspeech 2011

Interspeech 2011 is coming. It going to be an amazing event I suppose. If you are interested what is going on there, let's figure that out.

To keep things simple we will use Unix command line tools. Sometimes text processing could be fun even with simple commands. Text is still most conventint form of the information presentation, way better than HTML or databases. Of course there is lack for more advanced things like stopword filtering or named entity recognition. Let's hope one day Unix command line will have them.

1. Download full printable programs of Interspeech 2010 and Interspeech 2011 with wget, dump them to text with lynx and cleanup punctuation with sed.

2. Dump word counts with SRILM tool ngram-count and cut 1000 most frequent words on list for 2011 with head and sort. Leave all words in 2010 list.

3. Figure out which of the words in 2011 list are new and do not appear in 2010 list with sort and uniq.

Suprisingly there will be only 2 new words. They are: i-vector and crowdsourcing.

1 comment:

Post a Comment