c++

I'm writing a multithreaded application in c++, where performance is critical. I need to use a lot of locking while copying small structures between threads, for this I have chosen to use spinlocks.

I have done some research and speed testing on this and I found that most implementations are roughly equally fast:

  • Microsofts CRITICAL_SECTION, with SpinCount set to 1000, scores about 140 time units
  • Implementing this algorithm with Microsofts InterlockedCompareExchange scores about 95 time units
  • Ive also tried to use some inline assembly with __asm {} using something like this code and it scores about 70 time units, but I am not sure that a proper memory barrier has been created.
  • Edit: The times given here are the time it takes for 2 threads to lock and unlock the spinlock 1,000,000 times.

    I know this isn't a lot of difference but as a spinlock is a heavily used object, one would think that programmers would have agreed on the fastest possible way to make a spinlock. Googling it leads to many different approaches however. I would think this aforementioned method would be the fastest if implemented using inline assembly and using the instruction CMPXCHG8B instead of comparing 32bit registers. Furthermore memory barriers must be taken into account, this could be done by LOCK CMPXHG8B (I think?) , which guarantees "exclusive rights" to the shared memory between cores. At last [some suggests] that for busy waits should be accompanied by NOP:REP that would enable Hyper-threading processors to switch to another thread, but I am not sure whether this is true or not?

    From my performance-test of different spinlocks, it is seen that there is not much difference, but for purely academic purpose I would like to know which one is fastest. However as I have extremely limited experience in the assembly-language and with memory barriers, I would be happy if someone could write the assembly code for the last example I provided with LOCK CMPXCHG8B and proper memory barriers in the following template:

    __asm
    {
         spin_lock:
             ;locking code.
         spin_unlock:
             ;unlocking code.
    }
    

    Just look here: x86 spinlock using cmpxchg

    And thanks to Cory Nelson

    __asm{
    spin_lock:
    xorl %ecx, %ecx
    incl %ecx
    spin_lock_retry:
    xorl %eax, %eax
    lock; cmpxchgl %ecx, (lock_addr)
    jnz spin_lock_retry
    ret
    
    spin_unlock:
    movl $0 (lock_addr)
    ret
    }
    

    And another source says: http://www.geoffchappell.com/studies/windows/km/cpu/cx8.htm

           lock    cmpxchg8b qword ptr [esi]
    is replaceable with the following sequence
    
    try:
            lock    bts dword ptr [edi],0
            jnb     acquired
    wait:
            test    dword ptr [edi],1
            je      try
            pause                   ; if available
            jmp     wait
    
    acquired:
            cmp     eax,[esi]
            jne     fail
            cmp     edx,[esi+4]
            je      exchange
    
    fail:
            mov     eax,[esi]
            mov     edx,[esi+4]
            jmp     done
    
    exchange:
            mov     [esi],ebx
            mov     [esi+4],ecx
    
    done:
            mov     byte ptr [edi],0
    

    And here is a discussion about lock-free vs lock implementations: http://newsgroups.derkeiler.com/Archive/Comp/comp.programming.threads/2011-10/msg00009.html


    Although there is already an accepted answer, there are a few things that where missed that could be used to improve all the answers, taken from this Intel article, all above fast lock implementation:

  • Spin on a volatile read, not an atomic instruction, this avoids unneeded bus locking, especially on highly contended locks.
  • Use back-off for highly contested locks
  • Inline the lock, preferably with intrinsics for compilers where inline asm is detrimental (basically MSVC).

  • Wikipedia has a good article on spinlocks, here is the x86 implementation

    http://en.wikipedia.org/wiki/Spinlock#Example_implementation

    Notice their implementation doesn't use the "lock" prefix, because it is redundant on x86 for the "xchg" instruction - it implicitly has lock semantics, as discussed in this Stackoverflow discussion:

    On a multicore x86, is a LOCK necessary as a prefix to XCHG?

    The REP:NOP is an alias for the PAUSE instruction, you can learn more about that here

    How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?

    On the issue of memory barriers, here's everything you might want to know

    Memory Barriers: a Hardware View for Software Hackers by Paul E. McKenney

    http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07c.pdf

    链接地址: http://www.djcxy.com/p/64598.html

    上一篇: PeekMessage()抛出未处理的异常(访问冲突)

    下一篇: C ++