Optimize this assembly code

I'm taking a Computer Architecture course right now, and we're going over basic R-type and I-type instructions (also, this is a RISC architecture), etc. I can't seem to figure out how to optimize this code.

Explanation: This code adds words in an array (pointed to by $s1) of numbers until a zero is reached. The result is stored in $t1. $t0 holds the current word.

        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array
        beq   $t1, $t1, again        # Loop again
done:
        nop                          # Do nothing

I'm having a difficult time optimizing the code. I feel the beq $t1, $t1, again (since it's always true) is unnecessary, but I'm not sure how to remove it. Here's my attempt, but I now realize that my code would not terminate.

        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array
        bne   $t1, $zero, again      # If result is not zero, loop
done:
        nop                          # Do nothing

I'm never checking for a terminating zero and jumping to done. But if I add another check, then wouldn't the code be the same as before?


Typically you want to convert a test at the top loop into a test at the bottom loop. To do that, you frequently have to jump into (more or less) the middle of the loop body for the first iteration. In pseudo code, what you have right now is basically:

    initialize sum
beginning:
    load a word
    if (done) goto end
    add to sum
    increment pointer
    goto beginning
end:

to optimize that, we want to change the structure to something like this:

    initialize sum
    goto start_loop

beginning:
    add to sum
    increment pointer
start_loop:
    load a word
    if (!done) goto beginning

This way there's only one jump per loop instead of two (and it's a short backward jump, so the target will nearly always be in the cache).

That said, I should add that this optimization is really mostly obsolete -- with decent branch prediction, an unconditional jump is usually essentially free.

Edit: since loop unrolling has been mentioned, I'll add my two cents about it. Branch prediction generally renders loop unrolling obsolete, unless you can use it to execute more instructions in parallel. That's not an issue here, but often useful in real life. For example, if we assume a CPU with two separate adders available, we can unroll two iterations of the loop, and add the results from those iterations into two separate target registers, so we take advantage of both adders by adding two values at the same time. Then, when the loop terminates, we add together those two registers to get the final value. In C-like psuedo-code, that would come out something like this:

odds = 0
evens = 0

do {   
    evens += pointer[0];
    odds += pointer[1];
    pointer += 2;
while (pointer[0] && pointer[1]);
total = odds + evens;

As written, this adds the minor extra requirements for two consecutive zeros to terminate the sequence instead of just one, and a minimum of two items in the array to be added. Note, however, that it's not really the loop unrolling that gives the major advantage here, but the parallel execution.

Absent that, we really only see a benefit from loop unrolling if a branch that's not taken is cheaper than a branch that is taken (even if both are predicted correctly). On older processors (eg, older Intels) taking a branch did carry a penalty relative to a non-taken branch. At the same time, the unrolled loop will use more cache space. Thus, on a modern processor, it's frequently an overall loss (unless, as I said before, we can use the unrolling to gain parallelism).


展开你的循环:


        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        # and so on to a reasonable amount, 4-8 times are common.

        b     again        # Loop again
done:
        nop        
链接地址: http://www.djcxy.com/p/10356.html

上一篇: 时间轴上的视频文章不能内嵌播放

下一篇: 优化此汇编代码