W3Notepad
  • W3Notepad
  • OperatingSystem
    • 第一章 操作系统引论
    • 第二章 进程管理
    • 第三章 处理机调度与死锁
    • 第四章 存储管理
    • 第五章 设备管理
    • 第六章 文件管理
    • 第七章 操作系统接口
  • ComputerArchitecture
    • 01.Introduction
    • 02.Performance Analysis
    • 03.Instruction Set Architecture(ISA)
    • 04.流水线基础和性能分析
    • 05.MIPS单周期微架构
    • 06.Pipelining & Hazards
    • 07.Dependence Handling
    • 08.Exploiting ILP
    • 09.Memory Hierarchy
    • 10.Cache Memory
    • 11.Optimizing Cache Performance
    • 12.Main Memory
    • 13.Virtual Memory
    • 14.Multicore System
    • 15.IO System
    • 16.Interconnection Networks
  • SoftwareEngineering
    • 第1章 软件工程学概述
    • 第2章 可行性研究
    • 第3章 需求分析
    • 第5章 总体设计
    • 第6章 详细设计
    • 第7章 实现
    • 第9章 面向对象方法学引论
    • 第10章 面向对象分析
    • 第11章 面向对象设计
Powered by GitBook
On this page
  • Outline
  • Introduction
  • How to connect individual devices together
  • Reasons to concern interconnection networks
  • Interconnection Network Domains
  • Network Topology
  • Topology Overview
  • Network Metrics
  • Bus interconnect
  • Crossbar interconnect
  • Ring
  • Mesh
  • Torus
  • Trees
  • Hyper cube in different dimension
  • Multi-stage logarithmic
  • Review: network topologies
  • Flow Control
  • Flow Control Overview
  • Packets
  • Switching
  • Message-Based Flow Control
  • Circuit Switching
  • Packet-based Flow Control
  • Store and Forward
  • Packet-based: Virtual Cut Through
  • Virtual Cut Through
  • Flit-Level Flow Control
  • Wormhole Flow Control
  • Virtual Channels
  • Example IN
  • Blue Gene/L 3D Torus Network
  1. ComputerArchitecture

16.Interconnection Networks

Previous15.IO SystemNextSoftwareEngineering

Last updated 4 years ago

Outline

  • Introduction

  • Network Topology

  • Flow Control

  • Example IN

Introduction

How to connect individual devices together

Interconnection Network
  • Device (definition):

    • Component within a computer

    • Single computer

    • System of computers

  • Types of elements:

    • end nodes (device + interface)

    • links

    • interconnection network

  • Internetworking: interconnection of multiple networks

  • Interconnection networks should be designed to transfer the maximum amount of information within the least amount of time (and cost, power constraints) so as not to bottleneck the system

Reasons to concern interconnection networks

  • They provide external connectivity from system to outside world

  • Also, connectivity within a single computer system at many levels

    • I/O units, boards, chips, modules and blocks inside chips

  • Trends: high demand on communication bandwidth

    • increased computing power and storage capacity

    • switched networks are replacing buses

  • Computer architects must understand interconnect problems and solutions in order to more effectively design and evaluate systems

Interconnection Network Domains

  • Interconnection networks can be grouped into four major networking domains, depending on the number and proximity of devices to be interconnected:

    • OCNs, SANs, LANs, and WANs

  • On-chip networks (OCNs), a.k.a., network-on-chip (NoC)

    • Interconnect microarchitecture functional units, register files, caches, compute tiles, processor and IP cores

    • Chip or multichip modules

    • Tens (in future, possibly 100s) of devices interconnected

    • Maximum interconnect distance on the order of centimeters

  • System/storage area networks (SANs)

    • Multiprocessor and multicomputer systems

      • ›Interprocessor and processor-memory interconnections

    • Server and data center environments

      • ›Storage and I/O components

    • Hundreds to thousands of devices interconnected

      • ›IBM Blue Gene/L supercomputer (64K nodes, each with 2 processors)

    • Maximum interconnect distance typically on the order of tens of meters, but some with as high as a few hundred meters

      • ›InfiniBand: 120 Gbps over a distance of 300 m

  • Local area networks (LANs)

    • Machine room or throughout a building or campus

    • Hundreds of devices interconnected (1,000s with bridging)

    • Maximum interconnect distance on the order of few kilometers (some with distance spans of a few tens of kilometers)

    • Example (most popular): Ethernet, with 10 Gbps over 40Km

  • Wide area networks (WANs)

    • Interconnect systems distributed across the globe

    • Internetworking support is required

    • Many millions of devices interconnected

    • Maximum interconnect distance of many thousands of kilometers

    • Example: ATM

Network Topology

Topology Overview

  • Definition: determines arrangement of channels and nodes in network

    • Analogous to road map

  • Often first step in network design

  • Significant impact on network cost-performance

    • Determines number of hops

      • › Latency

      • ›Network energy consumption

    • Implementation complexity

      • ›Node degree

      • ›Ease of layout

Network Metrics

  • Metrics are used to evaluate the performance and cost of network with given topology

  • Also influenced by routing/flow control

    • At this stage, we skip the impacts

      • ›Assume ideal routing (perfect load balancing)

      • ›Assume ideal flow control (no idle cycles on any channel)

(1) Degree

  • Switch Degree: number of links at a node

    • Proxy for estimating cost

      • ›Higher degree requires more links and port counts at each router

(2) Hop Count

  • Path: ordered set of channels between source and destination

  • Hop Count: number of hops a message takes from source to destination

    • Simple, useful proxy for network latency

      • ›Every node, link incurs some propagation delay even when no contention

  • Minimal hop count: smallest hop count connecting two nodes

  • Network diameter: largest min hop count in network

  • Average minimum hop count: average across all src/dst pairs

    • Implementation may incorporate non-minimal paths

      • ›Increases average hop count

  • Uniform random traffic

    • Ring > Mesh > Torus

(3) Latency

  • Time for packet to traverse network

    • Start: head arrives at input port

    • End: tail departs output port

  • Latency = Head latency + serialization latency

    • Serialization latency: time for packet with Length L to cross channel with bandwidth b (L/b)

  • Approximate with hop count

    • Other design choices (routing, flow control) impact latency

      • ›We skip these impacts now

(4) Maximum Channel Load

  • Estimate max bandwidth the network can support

    • Max bits per second (bps) that can be injected by every node before it saturates

      • ›Saturation: network cannot accept any more traffic

    • Determine the most congested link

      • ›For given traffic pattern

      • ›Will limit overall network bandwidth

      • ›Estimate load on this channel

Maximum Channel Load Example

  • Uniform random

    • Every node has equal probability of sending to every node

  • Identify bottleneck channel

  • Half of traffic from every node will cross bottleneck channel

    • 8 x ½ = 4

  • Network saturates at ¼ injection bandwidth

(5) Bisection Bandwidth

  • Common off-chip metric (片下指标)

    • Proxy for cost

  • Amount of global wiring that will be necessary

    • Less useful for on-chip

      • ›Global on-chip wiring considered abundant

  • Cuts: partition all the nodes into two disjoint sets

    • Bandwidth of a cut

  • Bisection

    • A cut which divides all nodes into nearly half

    • Channel bisection -> min. channel count over all bisections

    • Bisection bandwidth -> min. bandwidth over all bisections

  • With uniform random traffic

    • ½ of traffic crosses bisection

Throughput Example

  • Bisection = 4 (2 in each direction)

  • With uniform random traffic

    • 3 sends 1/8 of its traffic to 4,5,6

    • 3 sends 1/16 of its traffic to 7 (2 possible shortest paths)

    • 2 sends 1/8 of its traffic to 4,5

    • Etc

  • Suppose channel load = 1

(6) Path Diversity

  • Multiple shortest paths (R) between source/destination

    pair

  • Fault tolerance

  • Better load balancing in network

  • Routing algorithm should be able to exploit path diversity

Many possible network topologies

  • Bus

  • Crossbar

  • Ring

  • Tree

  • Omega

  • Hypercube

  • Mesh

  • Torus

  • Butterfly

  • ………

Bus interconnect

  • Good:

    • Simple design

    • Cost effective for a small number of nodes

    • Easy to implement coherence (via snooping (侦听))

  • Bad:

    • Contention: all nodes contend for shared bus

    • Limited bandwidth: all nodes communicate over same wires

    • High electrical load = low frequency, high power

Crossbar interconnect

  • Each node is connected to every other node

  • Good:

    • O(1) latency and high bandwidth

  • Bad:

    • Not scalable: O(N2)

      switches

    • High cost

    • Difficult to arbitrate at scale

Note: in this network illustration, each node is drawn twice for clarity (at left and at top)

Ring

  • Good:

    • Simple

    • Cheap: O(N) cost

  • Bad:

    • High latency: O(N)

    • Bisection bandwidth remains constant as nodes are added (scalability issue)

  • Used in recent Intel architectures

    • Core i7, Xeon Phi

  • Also used in IBM CELL Broadband Engine (9 cores)

Intel’s ring interconnect (Sandy Bridge Microarch.)

  • Four rings

    • request

    • snoop

    • ack

    • data (32 bytes)

  • Six interconnect nodes: four “slices” of L3 cache + system agent + graphics

  • Each bank of L3 connected to ring bus twice

  • Theoretical peak BW from cores to L3 at 3.4 GHz is approx. 435 GB/sec

    • When each core is accessing its local slice

Mesh

  • Direct network

  • O(N) cost

  • Average latency:O(sqrt(N))

  • Easy to lay out on chip: fixed-length links

  • Path diversity: many ways for message to travel from one node to another

  • Used by:

    • Tilera processors

    • Prototype Intel chips

Torus

  • In mesh topology, node near edge or in middle of network is different

    • Torus topology introduces new links to avoid this problem

  • Still O(N) cost, but higher cost than 2D grid

  • Higher path diversity and bisection BW than mesh

  • Higher complexity

    • Difficult to layout on chip

    • Unequal link lengths

Trees

  • Planar, hierarchical topology

  • Like mesh/torus, good when traffic has locality

  • Latency: O(lg N)

  • Use “fat trees” to alleviate root bandwidth problem

Hyper cube in different dimension

  • Low latency: O(lg N)

  • Radix: O(lg N)

  • Number of links O(N lg N)

  • 6D hypercube used in 64-core Cosmic Cube computer developed at Caltech in the 80s

Multi-stage logarithmic

  • Indirect network with multiple switches between terminals

    • Cost: O(N lg N)

    • Latency: O(lg N)

  • Many variations: Omega, butterfly, Clos networks, etc

Review: network topologies

Flow Control

Flow Control Overview

  • Topology: determines connectivity of network

  • Routing: determines paths through network

  • Flow Control: determine allocation of resources to messages as they traverse network

    • Buffers and links

    • Significant impact on throughput and latency of network

  • Control state records

    • allocation of channels and buffers to packets

    • current state of packet traversing node

  • Channel bandwidth advances flits along nodes

  • Buffers hold flits waiting for channel bandwidth

Packets

  • Messages: composed of one or more packets

    • If message size is <= maximum packet size only one packet created

  • Packets: composed of one or more flits

  • Flit: flow control digit

  • Phit: physical digit

    • Subdivides flit into chunks = to link width

  • Off-chip: channel width limited by pins

    • Requires phits

  • On-chip: abundant wiring means phit size == flit size

  • Packet contains destination/route information

    • Flits may not -> all flits of a packet must take same route

Switching

  • Different flow control techniques based on granularity

    • Message-based: allocation made at message granularity (circuit-switching)

    • Packet-based: allocation made to whole packets

    • Flit-based: allocation made on a flit-by-flit basis

Message-Based Flow Control

  • Coarsest granularity

  • Circuit-switching

    • Pre-allocates resources across multiple hops

      • ›Source to destination

      • ›Resources = links

      • ›Buffers are not necessary

    • Probe sent into network to reserve resources

Circuit Switching

  • Once probe sets up circuit

    • Message does not need to perform any routing or allocation at each network hop

    • Good for transferring large amounts of data

      • ›Can amortize circuit setup cost by sending data with very low per-hop overheads

  • No other message can use those resources until transfer is complete

    • Throughput can suffer due to setup and hold time for circuits

    • Links are idle until setup is complete

Circuit Switching Example

  • Significant latency overhead prior to data transfer

    • Data transfer does not pay per-hop overhead for routing and allocation

  • When there is contention

    • Significant wait time

    • Message from 1 -> 2 must wait

Time-Space Diagram: Circuit-Switching

Packet-based Flow Control

  • Break messages into packets

  • Interleave packets on links

    • Better resource utilization

  • Requires per-node buffering to store in-flight packets

  • Two types of packet-based techniques

Store and Forward

  • Links and buffers are allocated to entire packet

  • Head flit waits at router until entire packet is received before being forwarded to the next hop

  • Not suitable for on-chip

    • Requires buffering at each router to hold entire packet

      • ›Packet cannot traverse link until buffering allocated to entire packet

    • Incurs high latencies (pays serialization latency at each hop)

Store and Forward Example

  • High per-hop latency

    • Serialization delay paid at each hop

  • Larger buffering required

Time-Space Diagram: Store and Forward

Packet-based: Virtual Cut Through

  • Links and Buffers allocated to entire packets

  • Flits can proceed to next hop before tail flit has been received by current router

    • But only if next router has enough buffer space for entire packet

  • Reduces the latency significantly compared to SAF

  • But still requires large buffers

    • Unsuitable for on-chip

Virtual Cut Through Example

  • Lower per-hop latency

  • Large buffering required

Time-Space Diagram: VCT

Virtual Cut Through

  • Throughput suffers from inefficient buffer allocation

Time-Space Diagram: VCT (2)

Flit-Level Flow Control

  • Help routers meet tight area/power constraints

  • Flit can proceed to next router when there is buffer

    space available for that flit

    • Improves over SAF and VCT by allocating buffers on a flitby-flit basis

Wormhole Flow Control

  • Pros

    • More efficient buffer utilization (good for on-chip)

    • Low latency

  • Cons

    • Poor link utilization: if head flit becomes blocked, all links spanning length of packet are idle

      • ›Cannot be re-allocated to different packet

      • ›Suffers from head of line (HOL) blocking

Wormhole Example

  • 6 flit buffers/input port

Time-Space Diagram: Wormhole

Virtual Channels

  • First proposed for deadlock avoidance

    • We’ll not talk about deadlock in the class

    • Read 《On-Chip Networks》 for reference

  • Can be applied to any flow control

    • First proposed with wormhole

Virtual Channel Flow Control

  • Virtual channels used to combat HOL blocking in

    wormhole

  • Virtual channels: multiple flit queues per input port

    • Share same physical link (channel)

    • A case of virtualization

  • Link utilization improved

    • Flits on different VC can pass blocked packet

  • Packets compete for VC on flit by flit basis

  • In example: on downstream links, flits of each packet are available every other cycle

  • Upstream links throttle because of limited buffers

  • Does not mean links are idle

    • May be used by packet allocated to other VCs

Virtual Channel Example

  • 6 flit buffers/input port

  • 3 flit buffers/VC

Example IN

Blue Gene/L 3D Torus Network

  • 360 TFLOPS (peak)

  • 2,500 square feet

  • Connects 65,536 dual-processor nodes and 1,024 I/O nodes

Number of devices interconnected
Typical Topologies in Processors
Degree
Hop Count
Maximum Channel Load Example
Throughput Example
Bus interconnect
Crossbar interconnect
Crossbars used in Sun/Oracle processors
Ring
Intel’s ring interconnect
Mesh
Tours
Trees
Hyper cube in different dimension
Multi-stage logarithmic
Review: network topologies
Flow Control
Packets
Packets
Circuit Switching Example
Circuit Switching Example
Time-Space Diagram: Circuit-Switching
Store and Forward Example
Time-Space Diagram: Store and Forward
Virtual Cut Through Example
Time-Space Diagram: VCT
Virtual Cut Through
Time-Space Diagram: VCT (2)
Wormhole Example
Time-Space Diagram: Wormhole
Virtual Channel Flow Control
Virtual Channel Flow Control
Virtual Channel Example
Blue Gene/L 3D Torus Network
Blue Gene/L 3D Torus Network