Optimizing C loops to get diagonal of array

2018-06-18 03:31:52

The Great God Google has not been forthcoming to me with an explanation for some loop optimization issues. So, in sadness that I have insufficient Google-fu, I turn to you StackOverflow.

I'm optimizing a C program for solving a specific system of differential equations. During the course of finding the numerical solution I call a function that sets up a linear system of equations and then a function that solves it.

The solution function originally had a bottleneck during the access of the elements on the diagonal of the array that defines the linear system. So I included a 1-D array that is set during the initialization of the system that holds the values along the diagonal of the array.

For fun I continued to play with the code that initialized the diagonal elements, measuring the time it took and trying to continually improve the code. The versions I tried led to a few questions:

Note: I put all the versions I tried into one function and profiled this function to see where the time was being spent. I'll report execution time for a version in percent of total time in function. The function was evaluated several million times. Smaller number is better.

Relevant declarations of data used in code:

/* quick definitions of the relevant variables, data is a struct */

static const int sp_diag_ind[98] = {2,12,23,76,120,129,137,142,.../* long list */};

double *spJ = &(data->spJ[0]);
/* data has double spJ[908] that represents a sparse matrix stored in triplet
*  form, I grab the pointer because I've found it to be more 
*  efficient than referencing data->spJ[x] each time I need it
*/

int iter,jter;
double *diag_data = NV_DATA_S(data->J_diag);
/* data->J_diag has a content field that has an array double diag_data[150]
*  NV_DATA_S is a macro to return the pointer to the relevant data
*/

My original loop for initializing diag_data. Timing was 16.1% of evaluation (see note).

/* try 1 */
for (iter = 0; iter<3; iter++) {
    diag_data[iter] = 0; 
}
jter = 0;
for (iter = 3; iter<101; iter++) { // unaligned loop start
    diag_data[iter] = spJ[sp_diag_ind[jter]];
    jter++; // heavy line for loop
}

for (iter = 101; iter<150; iter++) {
    diag_data[iter] = 0; 
}

To summarize, we grab the pointer to the diagonal, set the some components to zero (this is not optional based on the algorithm I'm using), then grab the values that reside on the diagonal of the "array" represented in sparse form by spJ. Since spJ is a 1-D array of the 908 non-zeros of a (mostly zero) 150x150 array we must use a lookup to find the positions of the diagonal elements in spJ. This lookup is defined by the 98 element array sp_diag_ind.

I tried to remove the use of jter because it showed up as not free to increment. The middle loop of my second try:

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This improved things a bit. Timing was 15.6% for this version. But when I look at the Shark analysis of this code (tool that comes with XCode on a Mac), it warns me that this is an unaligned loop.

The third attempt to improve was by removing the "zeroing" loops and using memset to zero diag_data:

memset(diag_data, '', sizeof(diag_data));

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This was timed at 14.9%. Not being sure what an unaligned loop is, I continued to fiddle. I found an improved, fourth implementation , doing the alignment offset between diag_data and spJ[crazy index] with a pointer:

realtype * diag_mask = &diag_data[3];
for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_mask[iter] = spJ[sp_diag_ind[iter]];
}

Using diag_mask allowed for a small improvement in speed. It came in at 13.1%.

Edit: Turns out this section was sillier than I had originally thought. The use of iter is undefined. Props to @caf and @rlibby for catching it.

Lastly, I then I tried something I thought was silly:

memset(diag_data, '', sizeof(diag_data));

for (iter = 0; iter<98;) {
    diag_mask[iter] = spJ[sp_diag_ind[iter++]];
}

This was timed at 10.9%. Also, Shark doesn't issue the unaligned loop warning when I look at the annotated source code. End silly section

So, my questions:

What's an unaligned loop?

Why is the fifth implementation aligned and the fourth not?

Is having an aligned loop responsible for the improvement in execution speed between my fourth and fifth implementations, or is embedding the increment step in the lookup of the value of sp_diag_ind responsible?

Do you see any other improvements I can make?

Thanks for the help.

--Andrew

An unaligned loop is one where the first instruction doesn't start on a particular boundary (multiple of 16 or 32). There should be a compiler flag to align loops; it may or may not help performance. Whether a loop is aligned or not without the flag is based on exactly which instructions come before it, so it's not predictable. One other optimization you could try is to mark diag_mask , spJ , and sp_diag_ind as restrict (a C99 feature). That indicates that they do not alias and might help the compiler to optimize the loop better. A count of 98 will probably be too small to see any effect, though.

Your fifth version is incorrect - it has undefined behaviour, because it both modifies iter and references its value, for a purpose other than calculating the new value, without an intervening sequence point.

Have you tried storing the actual values of the diagonals, rather than their indexes within spJ , at the point that you calculate sp_diag_ind[] ? Then you could just copy them directly into diag_data (or, even better, use the vector of diagonals directly).

The relevant part of the C standard is §6.5 Expressions:

'2. Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be read only to determine the value to be stored.

This applies to the object iter in your expression. Violation of a "shall" constraint is undefined behaviour.

gcc (tested with version 4.4.5) even warns about your expression:

x.c:16: warning: operation on ‘iter’ may be undefined

Do you see any other improvements I can make?

You're tuning the daylights out of something that uses about 11% of the time. Is there nothing in the other 89% that could be optimized?

链接地址: http://www.djcxy.com/p/51204.html

上一篇: 为Python扩展设置C / C ++编译器

下一篇: 优化C循环以获得数组的对角线