Speed of operations on misaligned data

To my knowledge, a CPU performs best with a datum that is aligned on the boundary equal to the size of that datum. For example, if every int datum is 4 bytes in size, then the address of every int must be a multiple of 4 to make the CPU happy; same with 2-byte short data and 8-byte double data. For this reason, new operator and malloc function always return an address that is a multiple of 8 and, therefore, is a multiple of 4 and 2.

In my program, some time-critical algorithms that are meant to process large byte arrays allow striding through the computation by converting each contiguous 4 bytes into an unsigned int and, in this way, do the arithmetic much faster. However, the address of the byte array is not guaranteed to be a multiple of 4 because only a part of a byte array may need to be processed.

As far as I know, Intel CPUs operate on misaligned data properly but at the expense of speed. If operating on misaligned data is slower enough, the algorithms in my program would need to be redesigned. In this connection I've got two questions, the first of which is supported with the following code:

// the address of array0 is a multiple of 4:
unsigned char* array0 = new unsigned char[4];
array0[0] = 0x00;
array0[1] = 0x11;
array0[2] = 0x22;
array0[3] = 0x33;
// the address of array1 is a multiple of 4 too:
unsigned char* array1 = new unsigned char[5];
array1[0] = 0x00;
array1[1] = 0x00;
array1[2] = 0x11;
array1[3] = 0x22;
array1[4] = 0x33;
// OP1: the address of the 1st operand is a multiple of 4,
// which is optimal for an unsigned int:
unsigned int anUInt0 = *((unsigned int*)array0) + 1234;
// OP2: the address of the 1st operand is not a multiple of 4:
unsigned int anUInt1 = *((unsigned int*)(array1 + 1)) + 1234;

So the questions are:

  • How much slower is OP2 compared to OP1 on x86, x86-64, and Itanium processors (if neglect the cost of type casting and address increment)?

  • When writing cross-platform portable code, about what kinds of processors should I be concerned regarding misaligned data access? (I already know about RISC ones)


  • There are far too many processors on the market to be able to give a generic answer. The only thing that can be stated with certainty is that some processors cannot do an unaligned access at all; this may or may not matter to you if your program is intended to run in a homogeneous environment, eg Windows.

    In a modern high-speed processor the speed of unaligned accesses may be more impacted by its cache alignment than its address alignment. On today's x86 processors the cache line size is 64 bytes.

    There's a Wikipedia article that might provide some general guidance: http://en.wikipedia.org/wiki/Data_structure_alignment

    链接地址: http://www.djcxy.com/p/9160.html

    上一篇: EC2上的应用程序如何自动发现ElastiCache实例?

    下一篇: 未对齐数据的运行速度