how to write inline assembly codes about LOOP in Xcode LLVM?
I'm studying about inline assembly. I want to write a simple routine in iPhone under Xcode 4 LLVM 3.0 Compiler. I succeed write basic inline assembly codes.
example :
int sub(int a, int b)
{
int c;
asm ("sub %0, %1, %2" : "=r" (c) : "r" (a), "r" (b));
return c;
}
I found it in stackoverflow.com and it works very well. But, I don't know how to write code about LOOP.
I need to assembly codes like
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity)
{
for(int i=0; i<numPixels; i++)
{
dst[i] = src[i] + intensity;
}
}
Take a look here at the loop section - http://en.wikipedia.org/wiki/ARM_architecture
Basically you'll want something like:
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"t mov r3, #0n"
"Lloop:n"
"t cmp r3, %2n"
"t bge Lendn"
"t ldrb r4, [%0, r3]n"
"t add r4, r4, %3n"
"t strb r4, [%1, r3]n"
"t add r3, r3, #1n"
"t b Lloopn"
"Lend:n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r3", "r4");
}
Update:
And here's that NEON version:
void brighten_neon(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"t mov r4, #0n"
"t vdup.8 d1, %3n"
"Lloop2:n"
"t cmp r4, %2n"
"t bge Lend2n"
"t vld1.8 d0, [%0]!n"
"t vqadd.s8 d0, d0, d1n"
"t vst1.8 d0, [%1]!n"
"t add r4, r4, #8n"
"t b Lloop2n"
"Lend2:n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r4", "d1", "d0");
}
So this NEON version will do 8 at a time. It does however not check that numPixels
is divisible by 8 so you'd definitely want to do that otherwise things will go wrong! Anyway, it's just a start at showing you what can be done. Notice the same number of instructions, but action on eight pixels of data at once. Oh and it's got the saturation in there as well that I assume you would want.
Though this answer is not directly an answer to your question, it is more a general advice regarding use of assembler versus modern compilers.
You will generally have a hard time beating the compiler regarding optimazation of your C code. Of course by clever use of certain knowledge about how your data behave it's possible that you might tweak it just a few percents.
One of the reasons for this is that modern compilers use a number of techniques when dealing with code like the one you describe, eg loop unrolling, instruction reordering to avoid pipeline stalls and bubbles, etc.
If you really want to make that algorithm scream, you should consider redesigning the algorithm instead in C so you avoid the worst delays. For instance reading and writing to memory is expensive compared to register access.
One way of accomplishing this could be to have your code load 4 bytes at a time by using an unsigned long
and then doing the math on this in registers before writing these 4 bytes back in one store operation.
So to recap, make your algorithm work smarter not harder.
链接地址: http://www.djcxy.com/p/10250.html上一篇: MVC 4剃刀模型与多尖括号的问题