Slow Instructions in Simple Loop on x86

I have a simple loop which I've written in C++ as I wanted to profile the performance of a multiply instruction on my CPU. I found some interesting nuances in the assembly code that was generated when I profiled it.

Here is the C++ program:

#define TESTS 10000000
#define BUFSIZE 1000
uint32_t buf_in1[BUFSIZE];
uint32_t buf_in2[BUFSIZE];
uint32_t volatile buf_out[BUFSIZE];

unsigned int i, j;

for (i = 0; i < BUFSIZE; i++) {
    buf_in1[i] = i;
    buf_in2[i] = i;
}

for (j = 0; j < TESTS; j++) {
    for (i = 0; i < BUFSIZE; i++) {
        buf_out[i] = buf_in1[i] * buf_in2[i];
    }
}

I compiled with the following flags:

Optimization: 优化

Code Generation:

代码生成

It's compiled in visual studio 2012 under Win32 although I am running it on a 64 bit machine.

Note the volatile qualifier on buf_out. It's just to stop the compiler from optimising the loop away.

I ran this code through a profiler (AMD's CodeXL) and I see that the multiplication instruction doesn't take up the majority of the CPU time. About 30% is taken up by the imul instruction, but around 60% is also spent on two other instructions:

探查

Note that the Timer column shows the number of timer ticks during which the profiler found the code on this instruction. The timer tick is 1ms so 2609 ticks is approximately 2609ms spent on that instruction.

The two instructions other than the multiply instruction which are taking up a lot of time are a mov instruction and the jb (jump if condition is met) instruction.

The mov instruction,

mov [esp+eax+00001f40h],ecx

is moving the result of the multiply (ecx) back into the buffer buf_out buffer at eax (which is the register representing i). This makes sense, but why does it take so much longer to do this than the other mov instruction? Ie this one:

mov ecx,[esp+eax+00000fa0h]

They both read from similar locations in memory, the arrays are 1000 uint32_t's long or 4000 bytes long. That's 4000*3 = 12kB. My L1 cache is 64kB so it should all easily fit in L1 as far as I can see...

Here are results showing my cache sizes etc. from Coreinfo:

Coreinfo

As for the jump instruction:

jb $-1ah (0x903732)

I can't tell why it's taking up 33% of the program execution time either. My processor line size is 64 bytes and the jump only jumps backwards 0x1A bytes or 26 bytes. Could it be because this jump crosses a 64-byte boundry? (0x903740 is a 64 byte boundary)

So can anyone explain these behaviours?

Thanks.


As mentioned by Mystical, the timings you are looking at are not one to one the responsibility of the instructions it is shown against.

Modern processors run many instructions in parallel (the imul and the add 4 to eax can both run in parallel, also the math involved in the mov addressing uses the ALU too and can be computed before the imul completes).

The way most profilers compute their timing is by using timed interrupts and what you see are the instructions that happened to be the ones executed at the time of the interrupts.

To properly use a profiler, you want to run against large programs and see whether the program spends a lot of time. On a per instruction basis, it does not have much value.

If you really want to do speed tests, you want to use the CPU timer before and after your loops and see how you can ameliorate it one way or another to get it to run faster.


I wouldn't assume that it all fits in your L1, because the code you're debugging isn't the only thing using the CPU (unless you've booted your machine to run that code, which in fact will be your operating system).

Also note that there's a pattern there: The slowest operations all are requiring main memory access. Since this access time isn't controlled by the CPU, it's difficult to point out why isn't it faster. That will require hardware analysis.

Hope this helps.


Unfortunately, you have not given the amount of time needed for a single pass through your loop, but I assume that it's three CPU cycles. If that is true, the three instructions that happen to get time attributed to them are the three instructions which the processor is officially on when the clock ticks. The other three instructions are executed in parallel to the three officially time consuming instructions, hiding behind them.

链接地址: http://www.djcxy.com/p/28782.html

上一篇: 如何将AVX512寄存器zmm26中的QuadWord写入rax寄存器?

下一篇: x86中简单循环中的慢指令