What does multicore assembly language look like?

Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc.

With modern CPUs that have 4 cores (or even more), at the machine code level does it just look like there are 4 separate CPUs (ie are there just 4 distinct "EDX" registers) ? If so, when you say "increment the EDX register", what determines which CPU's EDX register is incremented? Is there a "CPU context" or "thread" concept in x86 assembler now?

How does communication/synchronization between the cores work?

If you were writing an operating system, what mechanism is exposed via hardware to allow you to schedule execution on different cores? Is it some special priviledged instruction(s)?

If you were writing an optimizing compiler/bytecode VM for a multicore CPU, what would you need to know specifically about, say, x86 to make it generate code that runs efficiently across all the cores?

What changes have been made to x86 machine code to support multi-core functionality?


This isn't a direct answer to the question, but it's an answer to a question that appears in the comments. Essentially, the question is what support the hardware gives to multi-threaded operation.

Nicholas Flynt had it right, at least regarding x86. In a multi threaded environment (Hyper-threading, multi-core or multi-processor), the Bootstrap thread (usually thread 0 in core 0 in processor 0) starts up fetching code from address 0xfffffff0 . All the other threads start up in a special sleep state called Wait-for-SIPI. As part of its initialization, the primary thread sends a special inter-processor-interrupt (IPI) over the APIC called a SIPI (Startup IPI) to each thread that is in WFS. The SIPI contains the address from which that thread should start fetching code.

This mechanism allows each thread to execute code from a different address. All that's needed is software support for each thread to set up its own tables and messaging queues. The OS uses those to do the actual multi-threaded scheduling.

As far as the actual assembly is concerned, as Nicholas wrote, there's no difference between the assemblies for a single threaded or multi threaded application. Each logical thread has its own register set, so writing:

mov edx, 0

will only update EDX for the currently running thread. There's no way to modify EDX on another processor using a single assembly instruction. You need some sort of system call to ask the OS to tell another thread to run code that will update its own EDX .


As I understand it, each "core" is a complete processor, with its own register set. Basically, the BIOS starts you off with one core running, and then the operating system can "start" other cores by initializing them and pointing them at the code to run, etc.

Synchronization is done by the OS. Generally, each processor is running a different process for the OS, so the multi-threading functionality of the operating system is in charge of deciding which process gets to touch which memory, and what to do in the case of a memory collision.


Minimal runnable Intel x86 bare metal example

Runnable bare metal example with all required boilerplate. All major parts are covered below.

Tested on Ubuntu 15.10 QEMU 2.3.0 and Lenovo ThinkPad T400.

The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 covers SMP in chapters 8, 9 and 10.

Table 8-1. "Broadcast INIT-SIPI-SIPI Sequence and Choice of Timeouts" contains an example that basically just works:

MOV ESI, ICR_LOW    ; Load address of ICR low dword into ESI.
MOV EAX, 000C4500H  ; Load ICR encoding for broadcast INIT IPI
                    ; to all APs into EAX.
MOV [ESI], EAX      ; Broadcast INIT IPI to all APs
; 10-millisecond delay loop.
MOV EAX, 000C46XXH  ; Load ICR encoding for broadcast SIPI IP
                    ; to all APs into EAX, where xx is the vector computed in step 10.
MOV [ESI], EAX      ; Broadcast SIPI IPI to all APs
; 200-microsecond delay loop
MOV [ESI], EAX      ; Broadcast second SIPI IPI to all APs
                    ; Waits for the timer interrupt until the timer expires

On that code:

  • Most operating systems will make most of those operations impossible from ring 3 (user programs).

    So you need to write your own kernel to play freely with it: a userland Linux program will not work.

  • At first, a single processor runs, called the bootstrap processor (BSP).

    It must wake up the other ones (called Application Processors (AP)) through special interrupts called Inter Processor Interrupts (IPI).

    Those interrupts can be done by programming Advanced Programmable Interrupt Controller (APIC) through the Interrupt command register (ICR)

    The format of the ICR is documented at: 10.6 "ISSUING INTERPROCESSOR INTERRUPTS"

    The IPI happens as soon as we write to the ICR.

  • ICR_LOW is defined at 8.4.4 "MP Initialization Example" as:

    ICR_LOW EQU 0FEE00300H
    

    The magic value 0FEE00300 is the memory address of the ICR, as documented at Table 10-1 "Local APIC Register Address Map"

  • The simplest possible method is used in the example: it sets up the ICR to send broadcast IPIs which are delivered to all other processors except the current one.

    But it is also possible, and recommended by some, to get information about the processors through special data structures setup by the BIOS like ACPI tables or Intel's MP configuration table and only wake up the ones you need one by one.

  • XX in 000C46XXH encodes the address of the first instruction that the processor will execute as:

    CS = XX * 0x100
    IP = 0
    

    Remember that CS multiples addresses by 0x10 , so the actual memory address of the first instruction is:

    XX * 0x1000
    

    So if for example XX == 1 , the processor will start at 0x1000 .

    We must then ensure that there is 16-bit real mode code to be run at that memory location, eg with:

    cld
    mov $init_len, %ecx
    mov $init, %esi
    mov 0x1000, %edi
    rep movsb
    
    .code16
    init:
        xor %ax, %ax
        mov %ax, %ds
        /* Do stuff. */
        hlt
    .equ init_len, . - init
    

    Using a linker script is another possibility.

  • The delay loops are an annoying part to get working: there is no super simple way to do such sleeps precisely.

    Possible methods include:

  • PIT (used in my example)
  • HPET
  • calibrate the time of a busy loop with the above, and use it instead
  • Related: How to display a number on the screen and and sleep for one second with DOS x86 assembly?

  • I think the initial processor needs to be in protected mode for this to work as we write to address 0FEE00300H which is too high for 16-bits

  • To communicate between processors, we can use a spinlock on the main process, and modify the lock from the second core.

    We should ensure that memory write back is done, eg through wbinvd .

  • Shared state between processors

    8.7.1 "State of the Logical Processors" says:

    The following features are part of the architectural state of logical processors within Intel 64 or IA-32 processors supporting Intel Hyper-Threading Technology. The features can be subdivided into three groups:

  • Duplicated for each logical processor
  • Shared by logical processors in a physical processor
  • Shared or duplicated, depending on the implementation
  • The following features are duplicated for each logical processor:

  • General purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, and EBP)
  • Segment registers (CS, DS, SS, ES, FS, and GS)
  • EFLAGS and EIP registers. Note that the CS and EIP/RIP registers for each logical processor point to the instruction stream for the thread being executed by the logical processor.
  • x87 FPU registers (ST0 through ST7, status word, control word, tag word, data operand pointer, and instruction pointer)
  • MMX registers (MM0 through MM7)
  • XMM registers (XMM0 through XMM7) and the MXCSR register
  • Control registers and system table pointer registers (GDTR, LDTR, IDTR, task register)
  • Debug registers (DR0, DR1, DR2, DR3, DR6, DR7) and the debug control MSRs
  • Machine check global status (IA32_MCG_STATUS) and machine check capability (IA32_MCG_CAP) MSRs
  • Thermal clock modulation and ACPI Power management control MSRs
  • Time stamp counter MSRs
  • Most of the other MSR registers, including the page attribute table (PAT). See the exceptions below.
  • Local APIC registers.
  • Additional general purpose registers (R8-R15), XMM registers (XMM8-XMM15), control register, IA32_EFER on Intel 64 processors.
  • The following features are shared by logical processors:

  • Memory type range registers (MTRRs)
  • Whether the following features are shared or duplicated is implementation-specific:

  • IA32_MISC_ENABLE MSR (MSR address 1A0H)
  • Machine check architecture (MCA) MSRs (except for the IA32_MCG_STATUS and IA32_MCG_CAP MSRs)
  • Performance monitoring control and counter MSRs
  • Cache sharing is discussed at:

  • http://stackoverflow.com/questions/4802565/multiple-threads-and-cpu-cache
  • Can multiple CPU's / cores access the same RAM simutaneously?
  • Intel hyperthreads have greater cache and pipeline sharing than separate cores: https://superuser.com/questions/133082/hyper-threading-and-dual-core-whats-the-difference/995858#995858

    Linux kernel 4.2

    The main initialization action seems to be at arch/x86/kernel/smpboot.c .

    链接地址: http://www.djcxy.com/p/14450.html

    上一篇: 堆vs数据段vs堆栈分配

    下一篇: 多核汇编语言的外观如何?