How to allocate 16byte memory aligned data
I am trying to implement SSE vectorization on a piece of code for which I need my 1D array to be 16 byte memory aligned. However, I have tried several ways to allocate 16byte memory aligned data but it ends up being 4byte memory aligned.
I have to work with the Intel icc compiler. This is a sample code I am testing with:
#include <stdio.h>
#include <stdlib.h>
void error(char *str)
{
printf("Error:%sn",str);
exit(-1);
}
int main()
{
int i;
//float *A=NULL;
float *A = (float*) memalign(16,20*sizeof(float));
//align
// if (posix_memalign((void **)&A, 16, 20*sizeof(void*)) != 0)
// error("Cannot align");
for(i = 0; i < 20; i++)
printf("&A[%d] = %pn",i,&A[i]);
free(A);
return 0;
}
This is the output I get:
&A[0] = 0x11fe010
&A[1] = 0x11fe014
&A[2] = 0x11fe018
&A[3] = 0x11fe01c
&A[4] = 0x11fe020
&A[5] = 0x11fe024
&A[6] = 0x11fe028
&A[7] = 0x11fe02c
&A[8] = 0x11fe030
&A[9] = 0x11fe034
&A[10] = 0x11fe038
&A[11] = 0x11fe03c
&A[12] = 0x11fe040
&A[13] = 0x11fe044
&A[14] = 0x11fe048
&A[15] = 0x11fe04c
&A[16] = 0x11fe050
&A[17] = 0x11fe054
&A[18] = 0x11fe058
&A[19] = 0x11fe05c
It is 4byte aligned everytime, i have used both memalign, posix memalign. Since I am working on Linux, I cannot use _mm_malloc neither can I use _aligned_malloc. I get a memory corruption error when I try to use _aligned_attribute (which is suitable for gcc alone I think).
Can anyone assist me in accurately generating 16byte memory aligned data for icc on linux platform.
The memory you allocate is 16-byte aligned. See:
&A[0] = 0x11fe010
But in an array of float
, each element is 4 bytes, so the second is 4-byte aligned.
You can use an array of structures, each containing a single float, with the aligned
attribute:
struct x {
float y;
} __attribute__((aligned(16)));
struct x *A = memalign(...);
The address returned by memalign
function is 0x11fe010
, which is a multiple of 0x10
. So the function is doing a right thing. This also means that your array is properly aligned on a 16-byte boundary. What you are doing later is printing an address of every next element of type float
in your array. Since float
size is exactly 4 bytes in your case, every next address will be equal to the previous one +4. For instance, 0x11fe010 + 0x4 = 0x11FE014
. Of course, address 0x11FE014
is not a multiple of 0x10
. If you were to align all floats on 16 byte boundary, then you will have to waste 16 / 4 - 1
bytes per element. Double-check the requirements for the intrinsics that you are using.
AFAIK, both memalign
and posix_memalign
are doing their job.
&A[0] = 0x11fe010
This is aligned to 16 byte.
&A[1] = 0x11fe014
When you do &A[1]
you are telling the compiller to add one position to a float
pointer. It will unavoidably lead to:
&A[0] + sizeof( float ) = 0x11fe010 + 4 = 0x11fe014
If you intend to have every element inside your vector aligned to 16 bytes, you should consider declaring an array of structures that are 16 byte wide.
struct float_16byte
{
float data;
float padding[ 3 ];
}
A[ ELEMENT_COUNT ];
Then you must allocate memory for ELEMENT_COUNT
(20, in your example) variables:
struct float_16byte *A = ( struct float_16byte * )memalign( 16, ELEMENT_COUNT * sizeof( struct float_16byte ) );
链接地址: http://www.djcxy.com/p/72730.html
上一篇: Delphi XE3中的64位内联汇编
下一篇: 如何分配16byte内存对齐数据