Optimize and rewrite the following C code

2018-06-03 05:46:56

This is a textbook question that involves rewriting some C code to make it perform best on a given processor architecture.

Given: targeting a single superscalar processor with 4 adders and 2 multiplier units.

Input structure (initialized elsewhere):

struct s {
    short a;
    unsigned v;
    short b;
} input[100];

Here is the routine to operate on this data. Obviously correctness must be ensured, but the goal is to optimize the crap out of it.

int compute(int x, int *r, int *q, int *p) {

    int i;
    for(i = 0; i < 100; i++) {

        *r *= input[i].v + x;
        *p = input[i].v;
        *q += input[i].a + input[i].v + input[i].b;
    }

    return i;
}

So this method has 3 arithmetic instructions to update the integers r, q, p.

Here's my attempt with comments explaining what I'm thinking:

//Use temp variables so we don't keep using loads and stores for mem accesses; 
//hopefully the temps will just be kept in the register file
int r_temp = *r;
int q_temp = *q;

for (i=0;i<99;i = i+2) {
    int data1 = input[i];
    int data2 = input[i+1]; //going to try partially unrolling loop
    int a1 = data1.a;
    int a2 = data2.a;
    int b1 = data1.b;
    int b2 = data2.b;
    int v1 = data1.v;
    int v2 = data2.v;

    //I will use brackets to make my intention clear the order of operations I was planning
    //with respect to the functional (adder, mul) units available

    //This is calculating the next iteration's new q value 
    //from q += v1 + a1 + b1, or q(new)=q(old)+v1+a1+b1

    q_temp = ((v1+q1)+(a1+b1)) + ((a2+b2)+v2);
    //For the first step I am trying to use a max of 3 adders in parallel, 
    //saving one to start the next computation

    //This is calculating next iter's new r value 
    //from r *= v1 + x, or r(new) = r(old)*(v1+x)

    r_temp = ((r_temp*v1) + (r_temp*x)) + (v2+x);
}
//Because i will end on i=98 and I only unrolled by 2, I don't need to 
//worry about final few values because there will be none

*p = input[99].v; //Why it's in the loop I don't understand, this should be correct
*r = r_temp;
*q = q_temp;

Ok so what did my solution give me? Looking at the old code, each loop iteration of i will take the minimum latency of max((1A + 1M),(3A)) where the former value is for calculating the new r while the latency of 3 Adds is for q.

In my solution, I am unrolling by 2 and trying to calculate the 2nd new value of r and q. Assuming the the latency of adders/multipliers is such that M = c*A (c is some integer > 1), the multiply operations for r are definitely the rate-limiting step, so I focus on that. I tried to use the multipliers in parallel as much as I could.

In my code, 2 multipliers are used at first in parallel to help calculate intermediate steps, then an add must combine those, then a final multiply is used for obtaining the last result. So for 2 new values of r (even though I only keep/care about the last one), it takes me (1M // 1M // 1A) + 1A + 1M, for a total latency of 2M + 1M sequentially. Dividing by 2, my latency per loop value is 1M + 0.5A . I calculate the original latency/value (for r) to be 1A + 1M. So if my code is correct (I did this all by hand, haven't tested yet!) then I have a small performance boost.

Also, hopefully by not accessing and updating pointers directly in the loop as much (thanks to temp variables r_temp and q_temp mainly), I save on some load/store latency.

That was my stab at it. Definitely interested in seeing what others come up with that's better!

Yes, it is possible to leverage the two shorts. Rearrange your struct as so

struct s {
    unsigned v;
    short a;
    short b;
} input[100];

and you might be able to get better alignment of the memory fields on your architecture, which might allow more of these structs to lie in the same memory page, which might allow you to encounter fewer memory page faults.

It's all speculative, that's why it is so important to profile.

If you have the right architecture, a rearrangement will give you better data structure alignment, which results in higher data density within the memory as fewer bits are lost to necessary padding to ensure type alignment with the data boundaries imposed by common memory architectures.

链接地址: http://www.djcxy.com/p/11156.html

上一篇: python3超级不适用于PyQt类

下一篇: 优化并重写下面的C代码