06.Pipelining & Hazards
Last updated
Last updated
More systematically:
Pipelining the execution of multiple instructions
Idea:
Divide the instruction processing cycle into
distinct “stages”
of processing
Ensure there are enough hardware resources to process one instruction in each stage
Process a different instruction in each stage simultaneously
Instructions consecutive in program order are processed in consecutive stages
Benefit: Increases instruction processing throughput
Multi-cycle: 4 cycles per instruction
Pipelined: 4 cycles per 4 instructions
实际总是如此美好吗?
Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
Repetition of identical operations
The same operation is repeated on a large number of different inputs
Repetition of independent operations
No dependencies between repeated operations
Uniformly partitionable suboperations
Processing can be evenly divided into uniform-latency suboperations
Non-pipelined version with delay T
k-stage pipelined version
注: 灰色为寄存器, 用于保存数据. 实际流水操作后需要计算保存数据的时间. 因为组合逻辑的时间/延迟不会低于门的延迟, 因此最大 BW 时, T/K = 1 gate delay.
Nonpipelined version with combinational cost G
k-stage pipelined version
Instruction fetch (IF)
Instruction decode and register operand fetch (ID/RF)
Execute/Evaluate memory address (EX/AG)
Memory operand fetch (MEM)
Store/writeback result (WB)
goto 1
Is this the correct partitioning?
Why not 4 or 6 stages?
Why not different boundaries?
5-stage speedup is 4, not 5 as predicted by the ideal model. Why?
各个段不均匀/等长.
注: 淡黄色部分为 Pipeline Registers.
-
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
IF
I0
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
ID
-
I0
I1
I2
I3
I4
I5
I6
I7
I8
I9
EX
-
-
I0
I1
I2
I3
I4
I5
I6
I7
I8
MEM
-
-
-
I0
I1
I2
I3
I4
I5
I6
I7
WB
-
-
-
-
I0
I1
I2
I3
I4
I5
I6
注: t4 开始为 full pipeline.
Identical set of control points as the single-cycle datapath!!
For a given instruction
same control signals as single-cycle, but
control signals required at different cycles, depending on stage
Option 1: decode once using the same logic as single-cycle and buffer signals until consumed
Option 2: carry relevant “instruction word/field” down the pipeline and decode locally within each or in a previous stage
Which one is better?
注: 第一种更好. 只需要一次译码, 之后只传输控制信号.
Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
Repetition of identical operations
The same operation is repeated on a large number of different inputs (e.g., all laundry loads go through the same steps)
Repetition of independent operations
No dependencies between repeated operations
Uniformly partitionable suboperations
Processing an be evenly divided into uniform-latency suboperations (that do not share resources)
Identical operations ... NOT!
-> different instructions -> not all need the same stages
Forcing different instructions to go through the same pipe stages
-> external fragmentation (some pipe stages idle for some instructions)
Uniform suboperations ... NOT!
-> different pipeline stages -> not the same latency
Need to force each stage to be controlled by the same clock
-> internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)
Independent operations ... NOT!
-> instructions are not independent of each other
Need to detect and resolve nter-instruction dependencies to ensure the pipeline provides correct results
-> pipeline stalls (pipeline is not always moving)
Balancing work in pipeline stages
How many stages and what is done in each stage
Keeping the pipeline correct, moving, and full
in the presence of events that disrupt pipeline flow
Handling dependences
Data
Control
Handling resource contention
Handling long-latency (multi-cycle) operations
Handling exceptions, interrupts
Overall: Improving pipeline throughput
Minimizing stalls
Stall: A condition when the pipeline stops moving
Resource contention
Dependences (between instructions)
Data
Control
Long-latency (multi-cycle) operations
Also called “ dependency” or less desirably “hazard”
Dependences dictate ordering requirements between instructions
Two types
Data dependence
Control dependence
Resource contention is sometimes called resource dependence
However, this is not fundamental to program semantics.
Happens when instructions in two pipeline stages need the same resource
Solution 1: Eliminate the cause of contention
Duplicate the resource or increase its throughput
E.g., use separate instruction and data memories (caches)
E.g., use multiple ports for memory structures
Solution 2: Detect the resource
contention and stall one of the contending stages
Which stage do you stall?
Example: What if you had a single read and write port for the register file?
Types of data dependences
Flow dependence
(true dependence * read after write)
Output dependence
(write after write)
Anti dependence
(write after read)
Which ones cause stalls in a pipelined machine?
For all of them, we need to ensure semantics of the program is correct(底线)
Flow dependences always need to be obeyed because they constitute true dependence on a value
Anti and output dependencies exist due to limited number of registers in the CPU
They are dependence on a name, not a value
We will later see what we can do about them
Anti and output dependencies are easier to handle
write to the destination in one stage and in program order
Flow dependences are more interesting
Four typical ways of handling flow dependences
Detect and wait until value is available in register file
Detect and forward/bypass data to dependent instruction
Detect and eliminate the dependence at the software level
No need for the hardware to detect dependence
Predict the needed value(s), execute “speculatively ” , and verify
Detection of dependence between instructions in a pipelined processor to guarantee correct execution
Software based interlocking
Hardware based interlocking
MIPS acronym?
Scoreboarding (记分牌算法)
Each register in register file has a Valid bit associated with it
An instruction that is writing to the register resets the Valid bit (set to 0)
An instruction in Decode stage checks if all its source and destination registers are Valid (value is 1)
Yes: No need to stall… No dependence
No: Stall the instruction
Advantage:
Simple, 1 bit per register
Disadvantage:
Need to stall for all types of dependences, not only flow dependence.
如何在输出相关和反相关不停顿?
What changes would you make to the scoreboard to enable this?
Combinational dependence check logic
Special logic that checks if any instruction in later stages (pipeline stages) is supposed to write to any source register of the instruction that is being decoded
Yes: stall the instruction/pipeline
No: no need to stall… no flow dependence
Advantage:
No need to stall on anti and output dependencies
Disadvantage:
Logic is more complex than a scoreboard
Logic becomes more complex as we make the pipeline deeper and wider (think superscalar execution
)
通过硬件检测到相关之后呢……
What do you do afterwards?
Observation: Dependence between two instructions is detected before the communicated data value becomes available
Option 1: Stall the dependent instruction right away
Option 2: Stall the dependent instruction only when necessary -> data forwarding/bypassing(数据前推)
Option 3: …
Problem: A consumer (dependent) instrction has to wait in decode stage until the producer instruction writes its value in the register file
Goal: We do not want to stall the pipeline unnecessarily
Observation: The data value needed by the consumer instrction can be supplied deirectly from a later stage in the pipeline (instead of only from the register file)
Idea: Add additional dependence check logic and data forwarding paths(buses) to supply the producer's value to the consumer right after the value in available
Question: What should the fetch PC be in the next cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones. Why?
If the fetched instruction is a non-control-flow instruction:
Next Fetch PC is the address of the next-sequential instruction
Easy to determine if we know the size of the fetched instruction
If the instruction that is fetched is a control-flow instruction:
How do we determine the next Fetch PC?
In fact, how do we know whether or not the fetched instruction is a control-flow instruction?