public: True class: center, middle # Intel Haswell 微架构 .center[丁保荣 <br> <a href="mailto:Ricky.Ting@qq.com">Ricky.Ting@qq.com</a> ] <div id="qrcode" style="margin: 0 auto; width: 286px;"></div> .center[<a href="." id="this-slide-url">f</a>] --- ## 概览 <hr size=1> * **Haswell Introduction** * **Haswell ISA** * **Haswell Front End** * **Haswell Out-of-Order Scheduling** * **Haswell Execution Engine** * **Haswell Memory Hierarchy** * **Reference** --- ## Haswell Introduction <hr size=1> Intel的"Tick-Tock"[\[1\]](#refer-anchor-1) * Tick(制程): 在架构不变的情况下,缩小晶体管体积,以减少功耗及成本 * Tock(架构): 在制程不变的情况下,更新处理器架构,以提高性能 -- 但因为摩尔定律的失效,Intel在2016宣布,Tick Tock将放缓至三年一循环,增加了优化环节: * Process(制程) * Architecture(架构) * Optimization(优化): 在制程及架构不变的情况下,进行修复及优化,将BUG减到最低,并提升处理器频率 --- ## Haswell Introduction <hr size=1> Haswell[\[2\]](#refer-anchor-1)架构特性: -- * 采用22nm FinFET制造工艺 -- * 14到19级流水线(取决于uop cache是否hit) [\[3\]](#refer-anchor-1) -- * 每个核心都有64KB L1 Cache和256KB L2 Cache -- * AVX2指令集: 把整数SIMD提升到256bits的vector,增加了一些gather指令 -- * FMA指令集: 可以计算如:$\textbf{a} = \textbf{a} \cdot \textbf{b} + \textbf{c}$ -- * BMI指令集: 比特位操作指令,对加密、网络等应用有帮助 -- * 引入TSX(transactional memory), 对并发编程很有帮助 -- * 2way-超线程技术 --- ## Haswell ISA <hr size=1> Intel在Haswell引入了一些新的指令集,主要包括下面几种: -- * AVX2指令集[\[4\]](#refer-anchor-1):支持256bit的整数向量运算,还引入了16条gather指令 -- .center[<img src="/image/Haswell/YMM.gif" width="40%">] --- ## Haswell ISA <hr size=1> Intel在Haswell引入了一些新的指令集,主要包括下面几种: * AVX2指令集[\[4\]](#refer-anchor-1):支持256bit的整数向量运算,还引入了16条gather指令 .center[<img src="/image/Haswell/AVX2.png" width="60%">] --- ## Haswell ISA <hr size=1> Intel在Haswell引入了一些新的指令集,主要包括下面几种: * AVX2指令集[\[4\]](#refer-anchor-1):支持256bit的整数向量运算,还引入了16条gather指令 * FMA指令集[\[5\]](#refer-anchor-1):支持类似$\textbf{a} = \textbf{a} \cdot \textbf{b} + \textbf{c}$的计算 -- .center[<img src="/image/Haswell/FMA.png" width="100%">] --- ## Haswell ISA <hr size=1> Intel在Haswell引入了一些新的指令集,主要包括下面几种: * AVX2指令集[\[4\]](#refer-anchor-1):支持256bit的整数向量运算,还引入了16条gather指令 * FMA指令集[\[5\]](#refer-anchor-1):支持类似$\textbf{a} = \textbf{a} \cdot \textbf{b} + \textbf{c}$的计算 * BMI指令集[\[6\]](#refer-anchor-1):主要是三类位操作:insert, shift and extract; bit counting; arbitrary precision integer multiply and rotation; -- .center[<img src="/image/Haswell/BMI.png" width="60%">] --- ## Haswell ISA <hr size=1> Intel在Haswell引入了一些新的指令集,主要包括下面几种: * AVX2指令集[\[4\]](#refer-anchor-1):支持256bit的整数向量运算,还引入了16条gather指令。 * FMA指令集[\[5\]](#refer-anchor-1):支持类似$\textbf{a} = \textbf{a} \cdot \textbf{b} + \textbf{c}$的计算。 * BMI指令集[\[6\]](#refer-anchor-1):主要是三类位操作:insert, shift and extract; bit counting; arbitrary precision integer multiply and rotation。 * TSX指令集:主要是为并发程序设计的。 --- ## Aside: Some basic terms <hr size=1> * uop(micro-op): Intel把x86指令翻译成RISC-like的uops. 一条x86指令可能会翻译成1, 2, 3, 4或更多的uop. -- ``` add eax, ebx // 1 uop, (add) add eax, [mem1] // 2 uops, (load, add) add [mem1], eax // 3 uops, (load, add, store) ``` -- * uop fusion: 前端可能会把两个uop合并起来变成一个fused-uop, 来节省流水线的带宽。 -- ``` mov [esi], eax // 1 fused uop, memory write add eax, [esi] // 1 fused uop, read-modify add [esi], eax // 2 signle + 1 fused uop, read-modify-write ``` --- ## Haswell Front End <hr size=1> <img align="right" src="/image/Haswell/haswell-front-end.png" width="45%"> * 2-way SMT -- * 32KB shared I-Cache(8-way) -- * 4KB partitioned ITLB(4-way) -- * 16B/cycle instruction fetch -- * 20 entry replicated instruction queue -- * 4 decoders(3 are simple decoder) -- * 1.5K uop Cache(introduced since Sandy Bridge) -- * Loop StreamDetector (LSD) -- * Stack Engine(SE) --- ## Haswell Front End <hr size=1> **uop cache** <img align="right" src="/image/Haswell/haswell-front-end.png" width="45%"> * 32B Window -- * indexed by IP of 1st inst of the window -- * tagged for two threads -- * 32 sets, 8 way, 6 uops per line -- * Each 32B window span 3 of the 8 ways in a set -- * performs like a 6KB instruction cache and has a roughly 80% hit rate(Intel) -- * must fully hit --- ## Haswell Front End <hr size=1> **Stack Engine** -- ``` push eax // 1 single and 1 fused uop push ebx // 1 single and 1 fused uop mov ebp, esp mov eax, [esp+16] ``` -- `esp` is in the critial path: the following insts rely on the value of `esp` -- Intel introduced Stack Engine, it keeps record of `ESP_d`, which is the difference of `ESP_p` and `ESP_o`. -- `ESP_p = ESP_o + ESP_d` -- ``` push eax // 1 fused uop, set ESP_d = -4 push ebx // 1 fused uop, set ESP_d = -8 mov ebp, esp // insert sync uop, set ESP_d = 0 mov eax, [esp+16] // No need to sync, set ESP_d = 0 ``` --- ## Haswell Front End <hr size=1> **Stack Engine** -- An example: ``` main: push 1 call FuncA pop exc push 2 call FuncA pop ecx ... FuncA: push ebp mov ebp, esp ; sync sub esp, 100 mov eax, [ebp+8] mov esp, ebp pop ebp ret ``` --- ## Haswell Front End <hr size=1> **Design points** <img align="right" src="/image/Haswell/haswell-front-end.png" width="45%"> -- * Make the comman case fast * prefer Decoder than ucode * 3 Simple Decoder * Loop Stream Detector * Stack Engine -- * Locality * uop Cache -- * Decouple * Why front-end? --- ## Haswell Out-of-Order Scheduling <hr size=1> .center[<img src="/image/Haswell/haswell-out-of-order.png" width="100%" >] -- * Unified Reservation Station -- * Register renaming -- * Move elimination -- * Branch Order Buffer(US6799268B1)[\[9\]](#refer-anchor-1) --- ## Haswell Execution Engine <hr size=1> .center[<img src="/image/Haswell/haswell-exec.png" width="60%" >] --- ## Haswell Execution Engine <hr size=1> .center[<img src="/image/Haswell/haswell-exec2.png" width="40%" >] -- * up to 8 uops/cycle -- * 4 INT ALU -- * 3 INT Vect ALU(256-bit integer SIMD execution) -- * 2 FP FMA(5 cycles) -- * 2 Load Data -- * 1 Store Data --- ## Haswell Execution Engine <hr size=1> Design points: -- * Make the comman case fast * More Integer ALU * Less Store * FMA -- * Enable more DLP * AVX2: 256-bit interger SIMD --- ## Haswell Memory Hierarchy <hr size=1> .center[<img src="/image/Haswell/mem.png" width="80%">] .center[<img src="/image/Haswell/cache.png" width="100%">] --- ## Reference <hr size=1> <div id="refer-anchor-1"></div> - [1] [Wikipedia: Tick–tock model](https://en.wikipedia.org/wiki/Tick–tock_model) - [2] [Wikipedia: Haswell (microarchitecture)][url_haswell] - [3] [Intel's Haswell Architecture Analyzed](https://www.anandtech.com/show/6355/intels-haswell-architecture) - [4] [Intel's AVX2](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2) - [5] [Intel's FMA](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=FMA&expand=2541) - [6] [Intel's BMI](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2541,3503&othertechs=BMI1) - [7] [Intel’s Haswell Microarchitecture](https://www.realworldtech.com/haswell-cpu/) - [8] [Intel's Sandy Bridge Microarchitecture](https://www.realworldtech.com/sandy-bridge/) - [9] [Intel's Branch Order Buffer](https://patentimages.storage.googleapis.com/31/f2/42/722d2a0eed0120/US6799268.pdf) - [10] [The microarchitecture of Intel, AMD and VIA CPUs](https://www.agner.org/optimize/microarchitecture.pdf) [url_haswell]: https://en.wikipedia.org/wiki/Haswell_(microarchitecture) --- class: center, middle # Q & A