So the algorithm:

1) Train a model for a month and become bored

2) Get an idea that SphinxTrain is compiled without optimization

3) Go to SphinxTrain/config and change compilation option from -O2 to -O3

4) Measure run time of a simple bw run with time command

5) See that time doesn't really change

6) Add -pg option to CFLAGS and LDFLAGS to collect profile

7) See most of the time we are running log_diag_eval function which is a simple weighted dot product computation

8) See the assembler code of the log_diag_eval

0x42c3b0 log_diag_eval: unpcklps %xmm0,%xmm0 0x42c3b3 log_diag_eval+3: test %ecx,%ecx 0x42c3b5 log_diag_eval+5: cvtps2pd %xmm0,%xmm0 0x42c3b8 log_diag_eval+8: je 0x42c3fd log_diag_eval+77 0x42c3ba log_diag_eval+10: sub $0x1,%ecx 0x42c3bd log_diag_eval+13: xor %eax,%eax 0x42c3bf log_diag_eval+15: lea 0x4(,%rcx,4),%rcx 0x42c3c7 log_diag_eval+23: nopw 0x0(%rax,%rax,1) 0x42c3d0 log_diag_eval+32: movss (%rdi,%rax,1),%xmm1 0x42c3d5 log_diag_eval+37: subss (%rsi,%rax,1),%xmm1 0x42c3da log_diag_eval+42: unpcklps %xmm1,%xmm1 0x42c3dd log_diag_eval+45: cvtps2pd %xmm1,%xmm2 0x42c3e0 log_diag_eval+48: movss (%rdx,%rax,1),%xmm1 0x42c3e5 log_diag_eval+53: add $0x4,%rax 0x42c3e9 log_diag_eval+57: cmp %rcx,%rax 0x42c3ec log_diag_eval+60: cvtps2pd %xmm1,%xmm1 0x42c3ef log_diag_eval+63: mulsd %xmm2,%xmm1 0x42c3f3 log_diag_eval+67: mulsd %xmm2,%xmm1 0x42c3f7 log_diag_eval+71: subsd %xmm1,%xmm0 0x42c3fb log_diag_eval+75: jne 0x42c3d0 log_diag_eval+32 0x42c3fd log_diag_eval+77: repz retq

9) Understand that it's not really as good here as it can be

10) Run

gcc -DPACKAGE_NAME=\"SphinxTrain\" -DPACKAGE_TARNAME=\"sphinxtrain\" \ -DPACKAGE_VERSION=\"1.0.99\" -DPACKAGE_STRING=\"SphinxTrain\ 1.0.99\" \ -DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 \ -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 \ -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_LIBM=1 \ -I/home/nshmyrev/SphinxTrain/../sphinxbase/include \ -I/home/nshmyrev/SphinxTrain/../sphinxbase/include -I../../../include -O3 \ -g -Wall -fPIC -DPIC -c gauden.c -o obj.x86_64-unknown-linux-gnu/gauden.o \ -ftree-vectorizer-verbose=2

to see that log_diag_eval loop isn't vectorized

11) Add -ffast-math and see it doesn't help

12) Rewrite function from

float64 log_diag_eval(vector_t obs, float32 norm, vector_t mean, vector_t var_fact, uint32 veclen) { float64 d, diff; uint32 l; d = norm; /* log (1 / 2 pi |sigma^2|) */ for (l = 0; l < veclen; l++) { diff = obs[l] - mean[l]; d -= var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */ } return d; }

to

log_diag_eval(vector_t obs, float32 norm, vector_t mean, vector_t var_fact, uint32 veclen) { float64 d, diff; uint32 l; d = 0.0; for (l = 0; l < veclen; l++) { diff = obs[l] - mean[l]; d += var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */ } return norm - d; /* log (1 / 2 pi |sigma^2|) */ }

to turn substraction which hurts to accumulation.

13) See that loop is now vectorized. Enjoy the speed!!!

The key thing to understand here is that programming is rather flexible and compilers are rather dumb. But you have to cooperate. So you need to use very simple constructs to let compiler do his work. Moreover, this idea of using simple constructs in the code has other benefits since it helps to keep code style clean and enables automated static analysis with tools like splint.

Maybe same applies to speech recognition. We need to help computers in their efforts to understand us. Speak slowly and articulate clearly and both we and computers will enjoy the result

If you are interested about loop vectorization in GCC, see here http://gcc.gnu.org/projects/tree-ssa/vectorization.html

## No comments:

## Post a Comment