Approximate cost to access various caches and main memory?

Can anyone give me the approximate time (in nanoseconds) to access L1, L2 and L3 caches, as well as main memory on Intel i7 processors?

While this isn't specifically a programming question, knowing these kinds of speed details is neccessary for some low-latency programming challenges.


EDIT :
The second link from Dave served the following numbers:

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

EDIT2 :
The most important is the notice under the cited table, saying:

"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS, NUMBERS OF DIMMS , ETC,ETC.. YOUR MILEAGE MAY VARY. "


Here is a Performance Analysis Guide for the i7 and Xeon range of processors. I should stress, this has what you need and more (for example, check page 22 for some timings & cycles for example).

Additionally, this page has some details on clock cycles etc

EDIT: I should highlight that, as well as timing/cycle information, the above intel document addresses much more (extremely) useful details of the i7 and Xeon range of processors (from a performance point of view).


Numbers everyone should know

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

From: Originally by Peter Norvig:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

一个视觉比较


Cost to access various memories in a pretty page

  • See this page presenting the memory latency decrease from 1990 to 2020.
  • Summary

  • Values having decreased but are stabilized since 2005

            1 ns        L1 cache
            3 ns        Branch mispredict
            4 ns        L2 cache
           17 ns        Mutex lock/unlock
          100 ns        Main memory (RAM)
        2 000 ns (2µs)  1KB Zippy-compress
    
  • Still some improvements, prediction for 2020

       16 000 ns (16µs) SSD random read (olibre's note: should be less)
      500 000 ns (½ms)  Round trip in datacenter
    2 000 000 ns (2ms)  HDD random read (seek)
    
  • See also other sources

  • What every programmer should know about memory from Ulrich Drepper (2007)
    Old but still an excellent deep explanation about memory hardware and software interaction.
  • Full PDF (114 pages)
  • Comments on LWN about PDF version
  • Another ones
  • Seven posts on LWN + Comments
  • Part 1 - Introduction
  • Part 2 - Cache
  • Part 3 - Virtual Memory
  • Part 4 - NUMA support
  • Part 5 - What programmers can do
  • Part 6 - More things programmers can do
  • Part 7 - Memory performance tools
  • Post The Infinite Space Between Words in codinghorror.com based on book Systems Performance: Enterprise and the Cloud
  • Click to each processor listed on http://www.7-cpu.com/ to see the L1/L2/L3/RAM/... latencies (eg Haswell i7-4770 has L1=1ns, L2=3ns, L3=10ns, RAM=67ns, BranchMisprediction=4ns)
  • http://idarkside.org/posts/numbers-you-should-know/
  • See also a training

    For further understanding, I recommend the excellent presentation of modern cache architectures (June 2014) from Gerhard Wellein, Hannes Hofmann and Dietmar Fey at University Erlangen-Nürnberg.

    链接地址: http://www.djcxy.com/p/36352.html

    上一篇: Xeon每个内存访问会将多少个字节带入缓存?

    下一篇: 访问各种缓存和主内存的近似成本?