Intel Haswell 微架构

public: True
class: center, middle

# Intel Haswell 微架构

.center[丁保荣 <br> <a href="mailto:Ricky.Ting@qq.com">Ricky.Ting@qq.com</a> ]

.center[<a href="." id="this-slide-url">f</a>]

---

## 概览

* **Haswell Introduction**
* **Haswell ISA**
* **Haswell Front End**
* **Haswell Out-of-Order Scheduling**
* **Haswell Execution Engine**
* **Haswell Memory Hierarchy**
* **Reference**

---
## Haswell Introduction

Intel的"Tick-Tock"[\[1\]](#refer-anchor-1)
* Tick(制程): 在架构不变的情况下，缩小晶体管体积，以减少功耗及成本
* Tock(架构): 在制程不变的情况下，更新处理器架构，以提高性能

但因为摩尔定律的失效，Intel在2016宣布，Tick Tock将放缓至三年一循环，增加了优化环节：
* Process(制程)
* Architecture(架构)
* Optimization(优化): 在制程及架构不变的情况下，进行修复及优化，将BUG减到最低，并提升处理器频率

---

## Haswell Introduction

Haswell[\[2\]](#refer-anchor-1)架构特性：
--

* 采用22nm FinFET制造工艺
--

* 14到19级流水线（取决于uop cache是否hit） [\[3\]](#refer-anchor-1)
--

* 每个核心都有64KB L1 Cache和256KB L2 Cache
--

* AVX2指令集: 把整数SIMD提升到256bits的vector，增加了一些gather指令
--

* FMA指令集: 可以计算如：$\textbf{a} = \textbf{a} \cdot \textbf{b} + \textbf{c}$
--

* BMI指令集: 比特位操作指令，对加密、网络等应用有帮助
--

* 引入TSX(transactional memory), 对并发编程很有帮助
--

* 2way-超线程技术

---

## Haswell ISA

Intel在Haswell引入了一些新的指令集，主要包括下面几种：
--

* AVX2指令集[\[4\]](#refer-anchor-1)：支持256bit的整数向量运算，还引入了16条gather指令

--
.center[<img src="/image/Haswell/YMM.gif" width="40%">]

---

## Haswell ISA

Intel在Haswell引入了一些新的指令集，主要包括下面几种：
* AVX2指令集[\[4\]](#refer-anchor-1)：支持256bit的整数向量运算，还引入了16条gather指令

.center[<img src="/image/Haswell/AVX2.png" width="60%">]

---

## Haswell ISA

---

## Haswell ISA

Intel在Haswell引入了一些新的指令集，主要包括下面几种：
* AVX2指令集[\[4\]](#refer-anchor-1)：支持256bit的整数向量运算，还引入了16条gather指令
* FMA指令集[\[5\]](#refer-anchor-1)：支持类似$\textbf{a} = \textbf{a} \cdot \textbf{b} + \textbf{c}$的计算
* BMI指令集[\[6\]](#refer-anchor-1)：主要是三类位操作：insert, shift and extract; bit counting; arbitrary precision integer multiply and rotation;
--
.center[<img src="/image/Haswell/BMI.png" width="60%">]

---

## Haswell ISA

Intel在Haswell引入了一些新的指令集，主要包括下面几种：
* AVX2指令集[\[4\]](#refer-anchor-1)：支持256bit的整数向量运算，还引入了16条gather指令。
* FMA指令集[\[5\]](#refer-anchor-1)：支持类似$\textbf{a} = \textbf{a} \cdot \textbf{b} + \textbf{c}$的计算。
* BMI指令集[\[6\]](#refer-anchor-1)：主要是三类位操作：insert, shift and extract; bit counting; arbitrary precision integer multiply and rotation。
* TSX指令集：主要是为并发程序设计的。
---

## Aside: Some basic terms

* uop(micro-op): Intel把x86指令翻译成RISC-like的uops. 一条x86指令可能会翻译成1, 2, 3, 4或更多的uop.

```
add eax, ebx     // 1 uop, (add)
add eax, [mem1]  // 2 uops, (load, add)
add [mem1], eax  // 3 uops, (load, add, store)

```

* uop fusion: 前端可能会把两个uop合并起来变成一个fused-uop, 来节省流水线的带宽。

```
mov [esi], eax   // 1 fused uop, memory write
add eax, [esi]   // 1 fused uop, read-modify
add [esi], eax   // 2 signle + 1 fused uop, read-modify-write
```

---

## Haswell Front End

* 2-way SMT
--

* 32KB shared I-Cache(8-way) 
--

* 4KB partitioned ITLB(4-way)
--

* 16B/cycle instruction fetch
--

* 20 entry replicated instruction queue
--

* 4 decoders(3 are simple decoder)
--

* 1.5K uop Cache(introduced since Sandy Bridge)
--

* Loop StreamDetector (LSD)
--

* Stack Engine(SE)

---

## Haswell Front End

**uop cache**

* 32B Window
--

* indexed by IP of 1st inst of the window
--

* tagged for two threads
--

* 32 sets, 8 way, 6 uops per line
--

* Each 32B window span 3 of the 8 ways in a set
--

* performs like a 6KB instruction cache and has a roughly 80% hit rate(Intel)
--

* must fully hit

---

## Haswell Front End

**Stack Engine**

```
	push eax              // 1 single and 1 fused uop
	push ebx 			// 1 single and 1 fused uop
	mov ebp, esp  		
	mov eax, [esp+16]

```

`esp` is in the critial path: the following insts rely on the value of `esp`

Intel introduced Stack Engine, it keeps record of `ESP_d`, which is the difference of `ESP_p` and `ESP_o`.

`ESP_p = ESP_o + ESP_d`

```
	push eax                 // 1 fused uop, set ESP_d = -4
	push ebx                // 1 fused uop, set ESP_d = -8
	mov ebp, esp           // insert sync uop, set ESP_d = 0
	mov eax, [esp+16]     // No need to sync, set ESP_d = 0
```

---

## Haswell Front End

**Stack Engine**

An example:

```
main:
	push 1
	call FuncA
	pop exc
	push 2
	call FuncA
	pop ecx

...

FuncA:
	push ebp
	mov ebp, esp    	; sync 
	sub esp, 100
	mov eax, [ebp+8]
	mov esp, ebp
	pop ebp
	ret

```

---

## Haswell Front End

**Design points**

* Make the comman case fast
	* prefer Decoder than ucode
	* 3 Simple Decoder
	* Loop Stream Detector
	* Stack Engine

* Locality
	* uop Cache

* Decouple
	* Why front-end?

---

## Haswell Out-of-Order Scheduling

.center[<img src="/image/Haswell/haswell-out-of-order.png" width="100%" >]

* Unified Reservation Station
--

* Register renaming
--

* Move elimination
--

* Branch Order Buffer(US6799268B1)[\[9\]](#refer-anchor-1)
---

## Haswell Execution Engine

.center[<img src="/image/Haswell/haswell-exec.png" width="60%" >]

---

## Haswell Execution Engine

.center[<img src="/image/Haswell/haswell-exec2.png" width="40%" >]

* up to 8 uops/cycle
--

* 4 INT ALU
--

* 3 INT Vect ALU(256-bit integer SIMD execution)
--

* 2 FP FMA(5 cycles)
--

* 2 Load Data
--

* 1 Store Data

---

## Haswell Execution Engine

Design points:
--

* Make the comman case fast
	* More Integer ALU
	* Less Store
	* FMA
--

* Enable more DLP
	
	* AVX2: 256-bit interger SIMD

---

## Haswell Memory Hierarchy

.center[<img src="/image/Haswell/mem.png" width="80%">]

.center[<img src="/image/Haswell/cache.png" width="100%">]

---

## Reference

<div id="refer-anchor-1"></div>
- [1] [Wikipedia: Tick–tock model](https://en.wikipedia.org/wiki/Tick–tock_model)
- [2] [Wikipedia: Haswell (microarchitecture)][url_haswell] 
- [3] [Intel's Haswell Architecture Analyzed](https://www.anandtech.com/show/6355/intels-haswell-architecture)
- [4] [Intel's AVX2](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2)
- [5] [Intel's FMA](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=FMA&expand=2541)
- [6] [Intel's BMI](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2541,3503&othertechs=BMI1)
- [7] [Intel’s Haswell Microarchitecture](https://www.realworldtech.com/haswell-cpu/)
- [8] [Intel's Sandy Bridge Microarchitecture](https://www.realworldtech.com/sandy-bridge/)
- [9] [Intel's Branch Order Buffer](https://patentimages.storage.googleapis.com/31/f2/42/722d2a0eed0120/US6799268.pdf)
- [10] [The microarchitecture of Intel, AMD and
VIA CPUs](https://www.agner.org/optimize/microarchitecture.pdf)
[url_haswell]: https://en.wikipedia.org/wiki/Haswell_(microarchitecture)

---
class: center, middle
# Q & A