访问各种缓存和主内存的近似成本？

2018-06-12 17:18:54

任何人都可以给我大致的时间（以纳秒为单位）来访问L1，L2和L3缓存以及英特尔i7处理器上的主内存吗？

虽然这不是特别的编程问题，但知道这些类型的速度细节对于一些低延迟编程挑战来说是必需的。

EDIT ：
戴夫的第二个链接提供了以下数字：

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

EDIT2 ：
最重要的是引用表格中的通知，他说：

“注意：这些值是粗略的近似值，它们取决于核心和频率，频率，内存速度，BIOS设置，DIMM数量等等， 您的年龄可能有所不同。 ”

以下是针对i7和Xeon系列处理器的性能分析指南。我应该强调，这有你所需要的和更多的（例如，查看第22页的一些时间和周期例如）。

另外，这个页面还有一些关于时钟周期的细节

编辑：我应该强调，以及时间/周期信息，上述英特尔文档处理器（从性能角度）处理i7和Xeon系列处理器的更多（非常）有用的细节。

大家应该知道的数字

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

来自：最初由彼得Norvig：
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/，
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

一个视觉比较

在漂亮的页面上访问各种记忆的成本

请参阅此页面，介绍1990年至2020年的内存延迟减少情况。

概要

自2005年以来价值已经下降但趋于稳定

        1 ns        L1 cache
        3 ns        Branch mispredict
        4 ns        L2 cache
       17 ns        Mutex lock/unlock
      100 ns        Main memory (RAM)
    2 000 ns (2µs)  1KB Zippy-compress

还有一些改进，预测到2020年

   16 000 ns (16µs) SSD random read (olibre's note: should be less)
  500 000 ns (½ms)  Round trip in datacenter
2 000 000 ns (2ms)  HDD random read (seek)

另见其他来源

每个程序员应该知道关于Ulrich Drepper（2007）的记忆，
旧的但仍然是关于内存硬件和软件交互的极好的深入解释。

完整PDF（114页）

关于PDF版本的LWN评论

另一个

关于LWN +评论的七篇帖子

第1部分 - 介绍

第2部分 - 缓存

第3部分 - 虚拟内存

第4部分 - NUMA支持

第5部分 - 程序员可以做什么

第6部分 - 程序员可以做更多的事情

第7部分 - 内存性能工具

根据“系统性能：企业和云”一书，编码codinghorror.com中词语之间的无限空间

点击http://www.7-cpu.com/上列出的每个处理器，查看L1 / L2 / L3 / RAM / ...延迟（例如，Haswell i7-4770的L1 = 1ns，L2 = 3ns，L3 = 10ns，RAM = 67ns，BranchMisprediction = 4ns）

http://idarkside.org/posts/numbers-you-should-know/

另请参阅培训

为了进一步了解，我推荐Gerhard Wellein，Hannes Hofmann和Dietmar Fey在Erlangen-Nürnberg大学精彩地介绍现代缓存架构（2014年6月）。

链接地址: http://www.djcxy.com/p/36351.html

上一篇: Approximate cost to access various caches and main memory?

下一篇: Linear Search Algorithm Optimization