Some more optimization

In addition two the previous post, two more tricks for log_diag_eval.

Floats instead of double

If accumulator is float, SSE could be used more effectively

Hardcode vector length

The most common optimizaition is loop unrolling. It helps to optimize memory access as well as eliminates jump commands. But the issue here is that number of iterations in log_diag_eval can be different on various stages. GCC has interesting profile-based optimizaition for this case, see -fprofile-generate option. It runs a program and then can derive few specific optimizations form the runtime. Good point is that we actually can be almost sure in usage patters of the our target loop, so we can optimize without profiling. So, turn

for (i=0;i<veclen;i++) {
   do work


if (veclen == 40) { // Common used value, 40 floats in each frame
    for (i=0;i<40;i++) {
        do work // This will be unrolled
   } else {
    for (i=0;i<veclen;i++)
        do work

GCC does same trick with profiler, but since our feature frame size is fixed, we can hardcode. As a result GCC will unroll first loop and it will be fast as a wind

Optimization in SphinxTrain

I spend quite significant amount of time training various models. It feels like alchemy, you add this and tune there and you get nice results. And while training you can read twitter ;) I'm also 10 years in a group which is creating optimizing compilers so in theory I should know a lot about them. I rarely apply it in practice though. But being bored with several weeks training you can apply some knowledge here.