Approximate cost to access various caches and main memory?

2018-06-12 17:18:54

Can anyone give me the approximate time (in nanoseconds) to access L1, L2 and L3 caches, as well as main memory on Intel i7 processors?

While this isn't specifically a programming question, knowing these kinds of speed details is neccessary for some low-latency programming challenges.

EDIT :
The second link from Dave served the following numbers:

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

EDIT2 :
The most important is the notice under the cited table, saying:

"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS, NUMBERS OF DIMMS , ETC,ETC.. YOUR MILEAGE MAY VARY. "

Here is a Performance Analysis Guide for the i7 and Xeon range of processors. I should stress, this has what you need and more (for example, check page 22 for some timings & cycles for example).

Additionally, this page has some details on clock cycles etc

EDIT: I should highlight that, as well as timing/cycle information, the above intel document addresses much more (extremely) useful details of the i7 and Xeon range of processors (from a performance point of view).

Numbers everyone should know

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

From: Originally by Peter Norvig:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

一个视觉比较

Cost to access various memories in a pretty page

See this page presenting the memory latency decrease from 1990 to 2020.

Summary

Values having decreased but are stabilized since 2005

        1 ns        L1 cache
        3 ns        Branch mispredict
        4 ns        L2 cache
       17 ns        Mutex lock/unlock
      100 ns        Main memory (RAM)
    2 000 ns (2µs)  1KB Zippy-compress

Still some improvements, prediction for 2020

   16 000 ns (16µs) SSD random read (olibre's note: should be less)
  500 000 ns (½ms)  Round trip in datacenter
2 000 000 ns (2ms)  HDD random read (seek)

Approximate cost to access various caches and main memory?

Cost to access various memories in a pretty page

Summary

See also other sources

See also a training