cmpxchg for WORD faster than for BYTE
Yesterday I posted this question on how to write a fast spinlock. Thanks to Cory Nelson I seem to have found a method which outperforms the other methods discussed in my question. I use the CMPXCHG
instruction to check if the lock is 0 and thereby free. CMPXCHG
operates on ´BYTE´, WORD
and DWORD
. I would assume that the instruction would operate faster on BYTE
. But I wrote a lock implementing each of the datatypes:
inline void spin_lock_8(char* lck)
{
__asm
{
mov ebx, lck ;move lck pointer into ebx
xor cl, cl ;set CL to 0
inc cl ;increment CL to 1
pause ;
spin_loop:
xor al, al ;set AL to 0
lock cmpxchg byte ptr [ebx], cl ;compare AL to CL. If equal ZF is set and CL is loaded into address pointed to by ebx
jnz spin_loop ;jump to spin_loop if ZF
}
}
inline void spin_lock_16(short* lck)
{
__asm
{
mov ebx, lck
xor cx, cx
inc cx
pause
spin_loop:
xor ax, ax
lock cmpxchg word ptr [ebx], cx
jnz spin_loop
}
}
inline void spin_lock_32(int* lck)
{
__asm
{
mov ebx, lck
xor ecx, ecx
inc ecx
pause
spin_loop:
xor eax, eax
lock cmpxchg dword ptr [ebx], ecx
jnz spin_loop
}
}
inline spin_unlock(<anyType>* lck)
{
__asm
{
mov ebx, lck
mov <byte/word/dword> ptr [ebx], 0
}
}
The lock was then tested using the following pseudo-code (please note that the lcm-pointer always will point to an address dividable by 4):
<int/short/char>* lck;
threadFunc()
{
loop 10,000,000 times
{
spin_lock_8/16/32 (lck);
spin_unlock(lck);
}
}
main()
{
lck = (char/short/int*)_aligned_malloc(4, 4);//Ensures memory alignment
start 1 thread running threadFunc and measure time;
start 2 threads running threadFunc and measure time;
start 4 threads running threadFunc and measure time;
_aligned_free(lck);
}
I've gotten the following results measured in msecs on a processor with 2 physical cores able to run 4 threads (Ivy Bridge).
1 thread 2 threads 4 threads
8-bit 200 700 3200
16-bit 200 500 1400
32-bit 200 900 3400
The data suggests that all functions take an equal amount of time to execute. But when multiple threads have to check if lck == 0
using a 16-bit can be significantly faster. Why is that? I do not suppose it has something to do with the alignment of the lck
?
Thanks in advance.
From what I recall the lock works on a word (2 bytes). It was written that way when first introduced in the 486.
If you carry a lock on a different size, it actually generate the equivalent of 2 locks (lock word A and word B for a double word.) For a byte it probably has to prevent the locking of the second byte, which is somewhat similar to 2 locks...
So your results are in line with the CPU optimizations.
Imagine there's 1234 threads and 16 CPUs. One thread acquires the spinlock, then the OS does a task switch. Now you've 16 CPUs, each running one of the remaining 1233 threads, all spinning in a remarkably pointless way for however long it takes for the OS to give CPU time back to the only thread that can release the spinlock. This means the entire OS can basically lock up (with all CPUs going flat out) for a few seconds. This is seriously retarded; so how do you fix it?
You fix it by not using spinlocks in user-space. Spinlocks should only ever be used if/when task switches can be disabled; and only the kernel should be able to disable task switches.
More specifically, you need to use a mutex. Now the mutex may spin initially before giving up and making the thread wait for the lock, and (for typical/low contention cases) this does help, but it'd still be a mutex and is not a spinlock.
Next; for sane software, what matters (for performance) is avoiding lock contention, and then making sure that the uncontended case is fast (and a good mutex won't cause a task switch if there's no contention). You are measuring the contended/irrelevant case.
Finally; your lock is bad. To avoid excessive use of the lock
prefix you should test if you might be able to acquire without any lock
prefix, and only when you might be able to acquire should you use the lock
prefix. Intel (and probably lots of other people) call this strategy "test; then (test and set)". In addition you've failed to understand the purpose of pause
(or "rep nop" for assemblers that are so bad that they don't support 10 year old instructions).
A half decent spinlock might look something like:
acquire:
lock bts dword [myLock],0 ;Optimistically attempt to acquire
jnc .acquired ;It was acquired!
.retry:
pause
cmp dword [myLock],0 ;Should we attempt to acquire again?
jne .retry ; no, don't use `lock`
lock bts dword [myLock],0 ;Attempt to acquire
jc .retry ;It wasn't acquired, so go back to waiting
.acquired:
ret
release:
mov dword [myLock],0 ;No lock prefix needed here as "myLock" is aligned
ret
Also note that if you've failed to adequately minimise the chance of lock contention, then you do need to care about "fairness" and should not be using a spinlock. The problem with "unfair" spinlocks is that some tasks might be lucky and always get the lock, and some tasks may be unlucky and never get the lock because the lucky tasks have always got it. This has always been a problem for heavily contended locks, but for modern NUMA systems it's become a much more likely problem. In this case, at a minimum you should be using a ticket lock.
The basic idea of a ticket lock is to ensure that tasks acquire the lock in the order they arrive (and not some "possibly extremely bad" random order). For completeness, a ticket lock might look like this:
acquire:
mov eax,1
lock xadd [myLock],eax ;myTicket = currentTicket, currentTicket++
cmp [myLock+4],eax ;Is it my turn?
je .acquired ; yes
.retry:
pause
cmp [myLock+4],eax ;Is it my turn?
jne .retry ; no, wait
.acquired:
ret
release:
lock inc dword [myLock+4]
ret
tl;dr; You shouldn't be using the wrong tool for the job (spinlocks) to begin with; but if you insist on using the wrong tool then at least get the wrong tool implemented properly... :-)
链接地址: http://www.djcxy.com/p/11090.html上一篇: Xcode 4.4中的链接器错误
下一篇: 对于WORD,cmpxchg比BYTE快