PowerPoint Presentation

S. Tota and M.R. Casu

Seminar contents



The Premises



Homogenous and Heterogeneous
Systems-on-Chip and their interconnection
networks



The Network-on-Chip approach



Examples



Our THIN contribution (Sergio’s speech)



Back to the coffee corner…

S. Tota and M.R. Casu

The premises



The System-on-Chip (SoC) today



Heterogeneous ~10 IP’s



Homogeneous (MP-SoC) ~ 10 uP (with exceptions)



On-Chip BUS (AMBA, Core Connect, Wishbone, …)



IP and uP are sold with proprietary Bus IF



Near and long-term forecast



 100 IP/uP: Busses are non scalable!



Physical Design issues: signal integrity, power

consumption, timing closure



Clock issues: Is time for the Globally Asynchronous

paradigm? (Still locally synchronous)



Need for “more regular” design

S. Tota and M.R. Casu

Heterogeneous Today’s SoC

CPU

DSP

MEM

Embedded
FPGA

Dedicated
IP

Interconnection network
(BUS)

I/O

S. Tota and M.R. Casu

The Cell Processor



Fclock > 4 GHz.



Memory bandwidth: 25.6 GBytes per second.



I/O bandwidth: 76.8 GBytes per second.



Performance:



256 GFLOPS (Single precision at 4 GHz).



256 GOPS (Integer at 4 GHz).



25 GFLOPS (Double precision at 4 GHz).



235 square mm.



235 million transistors.



Power consumption estimated at 60 - 80 W @

4GHz

S. Tota and M.R. Casu

Cell’s Element Interconnect

Bus



From the trenches: D. Krolak, IBM



“Well, in the beginning, early in the development

process, several people were pushing for a

crossbar switch

, and the way the bus is

architected, you could actually pull out the EIB and

put in a crossbar switch

if you were willing to

devote more silicon space on the chip to wiring

We had to find a balance between connectivity

and area, and there just wasn't enough room to

put a full crossbar switch in. So we came up with

this ring structure which we think is very

interesting.

It fits within the area constraints and

still has very impressive bandwidth

.”

S. Tota and M.R. Casu

Homogeneous SoC (MP-SoC)

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

Interconnection network (BUS,

XBAR)

S. Tota and M.R. Casu

Very long wires

1 ns (1 GHz)

0.1 ns (10 GHz)

Year 2005

Year 2010

S. Tota and M.R. Casu

Bus pros (



) and cons ()



Every unit attached adds parasitic capacitance,

therefore electrical performance degrades with growth.



Bus timing is difficult in a deep submicron process.



Bus arbiter delay grows with the number of masters.

The arbiter is also instance-specific.



Bandwidth is limited and shared by all units attached.



Bus latency is zero once arbiter has granted control.



The silicon cost of a bus is near zero.



Any bus is almost directly compatible with most

available IPs, including software running on CPUs.



The concepts are simple and well understood.

S. Tota and M.R. Casu

What are NoC’s?



According to

Wikipedia



“Network-on-a-chip (NoC) is a new paradigm

for System-on-Chip (SoC) design. NoC

based-systems accommodate multiple

asynchronous clocking that many of today's

complex SoC designs use.

The NoC solution

brings a networking method to on-chip

communications

and claims roughly a

threefold performance increase over

conventional bus systems.”



Imprecise…

S. Tota and M.R. Casu

Processor

Master

Global

Memory

Slave

Global I/O

Slave

Global I/O

Slave

Processor

Master

Processor

Master

Processor

Master

Processor

Master

Processor

Master

Processor

Master

Processor

Master

Processor

Master

Routing

Node

Routing

Node

Routing

Node

Routing

Node

Routing

Node

Routing

Node

Routing

Node

Routing

Node

Routing

Node

NoC exemplified

S. Tota and M.R. Casu

Basic Ingredients of a NoC



N Computational

Resources



Processing Elements (PE)



1 Connection

Topology



Routing

technique



M  N

Switches



Network Interfaces

S. Tota and M.R. Casu

For the Connoisseurs…



Addressing

system



1 Switch-level

Arbitration

policy



1 Communication

Protocol



Programming

model



Message passing



Shared Memory



Bon appetit!

S. Tota and M.R. Casu

NoC: Good news



Only point-to-point one-way wires are

used, for all network sizes.



Aggregated bandwidth scales with

the network size.



Routing decisions are distributed and

the same router is re-instanciated, for

all network sizes.



NoCs increase the wires utilization

(as opposed to ad-hoc p2p wires)

S. Tota and M.R. Casu

There’s no free lunch…



Internal network contention causes (often
unpredictable) latency.



The network has a significant silicon area.



Bus-oriented IPs need smart wrappers.



Software needs clean synchronization in
multiprocessor systems.



System designers need reeducation for
new concepts.

S. Tota and M.R. Casu

Facts about NoC’s



It is a way to

decouple computation from

communication



The design is

layered

(physical, network,

application…): Taming complexity is made
easier



Communication between processing
elements in NoC takes place by encapsulating
data in

packets



The elementary packet piece to which switch
and routing operations apply is the

flit

S. Tota and M.R. Casu

Topologies



Heritage of networks with new constraints



Need to accommodate interconnects in a 2D layout



Cannot route long wires (clock frequency bound)

SPIN,

CLICHE’

Torus

Folded
torus

Octagon

BFT.

S. Tota and M.R. Casu

Topologies



Heritage of networks with new constraints



Need to accommodate interconnects in a 2D layout



Cannot route long wires (clock frequency bound)

S. Tota and M.R. Casu

Switching



Again, techniques inherited from Computer

and Communication Networks



New constraints in silicon:

area and power



Use

as few buffers as possible



Store & Forward and Virtual-Cut-Through



Need buffers size for an entire packet, unsuited!



Limited buffer size in

 Wormhole

 Deflection Routing

, a.k.a. “Hot Potato”



Virtual channels



Increase buffer size…

S. Tota and M.R. Casu

Routing



Deterministic vs. Adaptive



Simplify/Complicate routing logic



Easy/Uneasy deadlock free



Prone/Robust to congestion



2D dimension order routing (XY)

most used static routing in NoC (e.g.
with Wormhole and Mesh)

S. Tota and M.R. Casu

Who first had the idea?



No clear parenthood. The most referred

papers according to

Google (#cit.)



Guerrier’00 (

204

), A Generic Architecture for

On-Chip Packet-Switched Interconnections



Dally’01 (

392

), Route Packets, Not Wires: On-

Chip Interconnection Networks



Benini’02 (

417

), Networks on Chips: A New SoC

Paradigm



Kumar’02 (

184

), A Network on Chip Architecture

and Design Methodology

S. Tota and M.R. Casu

SPIN (Guerrier et al., DATE

’00/’03)



Wormhole switching, adaptive routing and credit-based flow control.



It is based on a fat-tree topology.



A flit is only one word (36 bits, 4 bits are for packet framing).



The input buffers have a depth of 4 words

S. Tota and M.R. Casu

Kumar et al., ISLVLSI’02



Chip-Level Integration of Communicating Heterogeneous Elements, CLICHÉ’



2D Mesh Topology



Message Passing

S. Tota and M.R. Casu

Pande et al., TCOMP’05



Butterfly Fat Tree



Wormhole, Virtual channels



Header flits: 3 ck cycles latency (input arbitration, routing, output arbitration)



“Body” flits: 3 ck cycles (input arbitration, switch traversal, output arbitration

S. Tota and M.R. Casu

Goossens et al., IEE CDT’03



Both VCT and WH, GT and BE,

IQ and VOQ



GT uses TDM to avoid

contention and create virtual

circuits. In each time slot a

block of 3 flits is transferred

from In “j” to Out “k” in a S&F

fashion.



BE uses Matrix Scheduling



GT connections set up by BE

special system packets



Prototype with WH and IQ



5 ports



0.13 um, 0.26 mm

, 500/166

MHz



Flit size = 3 words, each 32

bits



80 Gb/s aggregate bandwidth

S. Tota and M.R. Casu

Common properties



Data integrity

, meaning that data is

delivered uncorrupted



Lossless

data delivery, which means no

data is dropped in the interconnect



In-order

data delivery, which specifies

that the order in which data is delivered is

the same order in which it has been sent



Throughput and Latency

services that

offer time related bounds.

S. Tota and M.R. Casu

What is new?



Yes you are very right, no new concepts



Amazing application of network ideas to the chip

context



But ideas need to be re-contextualized



Old constraints



Latency, bandwidth



New constraints are very tight



Area, Power, Clocks



Differences of fine-grain NoC with large-grain Networks



Today links are 100% reliable. Might become false for ultra-

scaled technologies and globally asynchronous NoC



For many applications, lowest latency is more important than

highest bandwidth

S. Tota and M.R. Casu

Simulation Issues



Stochastic traffic generators



Ease of implementation/simulation



Fast simulation



MP-SoC loop interactions ignored?



Self-similar traffic used by some



Trace-Based Simulation



Need for extensive pre-simulation



Long simulations (days-weeks)



Accurate results

 Stay tuned for Sergio’s speech…

S. Tota and M.R. Casu

Applications



Main NoC feature: high communication

bandwidth



Desirable feature for MP-SoC: low

communication

latency



The twos are often contrasting requirements:



“Bandwidth problems can be cured with money.

Latency problems are harder because the speed of

light is fixed—you can’t bribe God.” —Anonymous



Desperately seeking benchmarks and

killer

applications

 Networking!!!

 Multimedia?

S. Tota and M.R. Casu

The THIN NoC



What we think will make a NoC sexy

enough for chip designers

 Least switch area and power

 Fast and low latency switch



Ideally one single clock cycle latency and cutting edge

clock frequency Fck (technology limited)

 Large bandwidth

= high Fck X high data

parallelism



Need for a

lightweight

NoC design



orino

awaii

nterconnection

etwork



Joint work with Hawaii University at Manoa,

Dept. Electrical Engineering

S. Tota and M.R. Casu

Some References



J. Rabaey et al., “A 1-V heterogeneous reconfigurable DSP IC for wireless

baseband digital signal processing,” IEEE Journal of Solid State Circuits, Vol.

35, No. 11, Nov. 2000, pp. 1697 - 1704



P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip Packet-

Switched Interconnections,” Proc. Design and Test in Europe (DATE), pp. 250-

256, Mar. 2000.



A. Adriahantenaina et al., “SPIN: a Scalable, Packet Switched, On-chip Micro-

network,” Proc. Design and Test in Europe (DATE), Mar. 2003.



L. Benini and G. De Micheli, “Networks on Chips: A New SoC Paradigm,”

Computer, vol. 35, no. 1, Jan. 2002, pp. 70-78.



S. Kumar et al., “A network on chip architecture and design methodology,” in

Proc. ISVLSI, 2002.



W. J. Dally and B. Towles, “Route packets, not wires: on-chip interconnection

networks,” in Proc. Design Automation Conf., 2001.



K. Goossens et al., “Trade-offs in the design of a router with both guaranteed

and best-effort services for networks on chip,” IEE Proc.-Comput. Digit. Tech.,

Vol. 150, No. 5, Sep. 2003, pp. 294-302.



P.P. Pande et al., “Performance Evaluation and Design Trade-offs for Network-

on-Chip Interconnect Architectures,” IEEE Trans. Computers, vol. 54, no. 8,

Aug. 2005, pp. 1025-1040.

Document Outline