

Engineering

## Architecting Chiplet-Based Systems

Natalie Enright Jerger Canada Research Chair in Computer Architecture enright@ece.utoronto.ca www.eecg.toronto.edu/~enright



Engineering

## Architecting Chiplet-Based Systems 3D memory



Natalie Enright Jerger Canada Research Chair in Computer Architecture enright@ece.utoronto.ca www.eecg.toronto.edu/~enright

## **Executive Summary**

#### Why build chiplet-based systems?

#### How to build chiplet-based systems?

#### Where do we go from here?

## **Executive Summary**

#### Why build chiplet-based systems?

End of technology scaling Rise of heterogeneity Demands of big data

#### How to build chiplet-based systems?

#### Where do we go from here?

## **Executive Summary**

#### Why build chiplet-based systems?

#### How to build chiplet-based systems? Reintegrate with hybrid topologies Deadlock-free routing for independent, modular design

#### Where do we go from here?



Source: G.E. Moore, Electronics 1965



Moore's Law: Enabling exponential growth in functionality per unit area of silicon



Source: G.E. Moore, Electronics 1965

Release Dates for Intel Lead Generation Products











#### GlobalFoundries Stops All 7nm Development: Opts To Focus on Specialized Processes

by Anton Shilov & Ian Cutress on August 27, 2018 4:01 PM EST

Posted in Semiconductors CPUs AMD GlobalFoundries 7nm 7LP

#### 7LP CANNED DUE TO STRATEGY SHIFT





Comment

#### GlobalFoundries Stops All 7nm Development: Opts To Focus on Specialized Processes



by Anton Shilov & Ian Cutress on August 27, 2018 4:01 PM EST

Posted in Semiconductors CPUs AMD GlobalFoundries 7nm 7LP

**7LP CANNED DUE TO STRATEGY SHIFT** 



N. Enright Jerger (University of Toronto)

#### Development costs 'prohibitively high' for 7nm chips for everybody but Apple and TSMC

#### By Roger Fingas

Tuesday, September 04, 2018, 05:51 am PT (08:51 am ET)

In the short term at least, Apple's 2018 iPhones are liable to be the only smartphones with 7nanometer processors, a report suggested on Tuesday.



#### **GlobalFoundries Stops All 7nm Development: Opts To Focus on Specialized Processes**



by Anton Shilov & Ian Cutress on August 27, 2018 4:01 PM ES

7LP CANNED DUE TO STRATEGY SHIF



N. Enright Jerger (University of Toronto)

#### Posted in Semiconductors CPUs AMD GlobalFoundries 7nm 7LP Development costs 'prohibitively' high' for 7nm chips for everybody but

#### Apple and TSMC

By Roger Fingas Tuesday, September 04, 2018, 05:51 am PT (08:51 am ET)

In the short term at least, Apple's 2018 iPhones are liable to be the only smartphones with 7nanometer processors, a report suggested on Tuesday.







N. Enright Jerger (University of Toronto)

#### **End of Dennard scaling**

#### **End of Dennard scaling**



Source: David Brooks

#### **End of Dennard scaling**



Source: David Brooks

#### **End of Dennard scaling**



#### **End of Dennard scaling**

Need power efficient alternatives to general purpose computing



Source: David Brooks

#### Not just Machine Learning SoC integration challenges for datacentres, cellphones





#### Heterogeneous manufacturing processes for different IP

## **Challenges: Big Data**



Source: International Data Corporation, 2016

## **Challenges: Big Data**



Workloads increasingly memory and communication bound Need to integrate lots of memory!

#### A means to continue integrating more functionality

A means to continue integrating more functionality

A means to deal with IP and manufacturing heterogeneity

A means to continue integrating more functionality

A means to deal with IP and manufacturing heterogeneity

A means to enable greater memory integration and efficient communication

A means to continue integrating more functionality

A means to deal with IP and manufacturing heterogeneity

A means to enable greater memory integration and efficient communication

# All while combating skyrocketing manufacturing costs

A means to continue integrating more functionality

A means to deal with IP and manufacturing hetero But first... A mean of the second sec

# All while combating skyrocketing manufacturing costs

N. Enright Jerger (University of Toronto)



#### Intel introduces 4004

1st commercial microprocessor

2300 transistors

**13mm**<sup>2</sup>





#### Everything on one chip



#### Everything on one chip

No more slow chip crossings
# Walk down memory lane (1971)



#### Everything on one chip

No more slow chip crossings

Cheaper manufacturing!

# Walk down memory lane (1971)



Everything on one chip

No more slow chip crossings

Cheaper manufacturing!

#### A sea change for the computer industry

#### **Disintegrate chips into chiplets (2020)**



# Large <u>Cost-Effective</u> SoCs through Disintegration



#### Why disintegrate?

# Want more functionality, but...

Big chips are expensive

# Break (disintegrate) into several smaller pieces

Cheaper to manufacture











Disintegrated SoCs have potential for reducing costs of large chips while maintaining functionality



























**Process variations lead to different maximum operating frequencies** 

**Process variations lead to different maximum operating frequencies** 

Sort chips before assembly to improve speed binning

**Process variations lead to different maximum operating frequencies** 

Sort chips before assembly to improve speed binning

Within die variations hurt performance of large monolithic chips

**Process variations lead to different maximum operating frequencies** 

Sort chips before assembly to improve speed binning

Within die variations hurt performance of large monolithic chips



#### **Fragmented Architecture**



Disintegrated SoCs have potential for reducing costs of large chips

#### **Fragmented Architecture**



Disintegrated SoCs have potential for reducing costs of large chips

But performance degrades with disintegration granularity

N. Enright Jerger (University of Toronto)

# How to integrate chiplets?



**Enable small/simple chiplets** 

**Enable small/simple chiplets** 

High bandwidth/low latency connections between chiplets

**Enable small/simple chiplets** 

High bandwidth/low latency connections between chiplets

**Ease of manufacturing** 

### How to integrate?

#### **Multi-chip modules (MCM)**



N. Enright Jerger (University of Toronto)





# How to integrate?

#### Multi-chip modules (MCM)



Avoids pin limitations of multi-package solutions






### Multi-chip modules (MCM)



Avoids pin limitations of multi-package solutions

Bandwidth/Latency constraints of C4 bumps and substrate







#### **Multi-chip modules (MCM)**



#### Multi-chip modules (MCM)



#### Multi-chip modules (MCM)

#### **Embedded Multi-Chip Interconnect Bridge (EMIB)**

Small/simple chiplets



#### Multi-chip modules (MCM)

- Small/simple chiplets
- Small bridge die



#### Multi-chip modules (MCM)

- Small/simple chiplets
- 😁 Small bridge die
- Avoids large die (interposer)



#### Multi-chip modules (MCM)

- Small/simple chiplets
- Small bridge die
- Avoids large die (interposer)
- Avoids manufacturing challenges/ costs



#### Multi-chip modules (MCM)

- Small/simple chiplets
- Small bridge die
- Avoids large die (interposer)
- Avoids manufacturing challenges/ costs
- Only offers point-to-point connections



#### Multi-chip modules (MCM)

#### **Embedded Multi-Chip Interconnect Bridge (EMIB)**

- Small/simple chiplets
- Small bridge die
- Avoids large die (interposer)
- Avoids manufacturing challenges/ costs
- Only offers point-to-point connections
- Misses opportunity to offload some functionality to interposer

Courtesy of Intel

### Multi-chip modules (MCM) Embedded Multi-Chip Interconnect Bridge (EMIB) Silicon Interposer (2.5D)

### Multi-chip modules (MCM) Embedded Multi-Chip Interconnect Bridge (EMIB) Silicon Interposer (2.5D)

Technology maturation (high volume passive interposer production — 3 years)





### Multi-chip modules (MCM) Embedded Multi-Chip Interconnect Bridge (EMIB) Active Silicon Interposer (2.5D)



N. Enright Jerger (University of Toronto)

#### Multi-chip modules (MCM)

#### **Embedded Multi-Chip Interconnect Bridge (EMIB)**

#### Active Silicon Interposer (2.5D)

😅 Simple, small chiplets



N. Enright Jerger (University of Toronto)

#### Multi-chip modules (MCM)

#### **Embedded Multi-Chip Interconnect Bridge (EMIB)**

#### Active Silicon Interposer (2.5D)

- Simple, small chiplets
- Move SoC functionality into interposer



#### Multi-chip modules (MCM)

#### **Embedded Multi-Chip Interconnect Bridge (EMIB)**

#### Active Silicon Interposer (2.5D)

- Simple, small chiplets
- Move SoC functionality into interposer
  - Implement in older technology node



Facilitates modular SoC design

Facilitates modular SoC design

But what about...

#### Facilitates modular SoC design

#### But what about...

Cost — Aren't interposers expensive?

#### Facilitates modular SoC design

#### But what about...

Cost — Aren't interposers expensive? Communication — How do we reintegrate?

#### Facilitates modular SoC design

#### But what about...

- Cost Aren't interposers expensive?
- Communication How do we reintegrate?
- How to maximize modularity while reintegrating?

## How do we architect chipletbased systems?

Topologies to connect chiplets Modular, deadlock-free routing

#### **Network-on-Chip on Interposer to Reintegrate**





**Q1:** <u>How</u> do you build a NoC on the interposer?

Q2: <u>What</u> type of NoC should you build?

#### **Network-on-Chip on Interposer to Reintegrate**





Q1: ... current interposers are passive!

**Q1:** <u>How</u> do you build a NoC on the interposer?

Q2: <u>What</u> type of NoC should you build?



#### **Network-on-Chip on Interposer to Reintegrate**





**Q1:** <u>How</u> do you build a NoC on the interposer?

Q2: <u>What</u> type of NoC should you build?

Q1: ... current interposers are passive!



An active interposer is a huge chip, which should have horrible yield, no?!?













#### Same Size Interposer









Chip yield impacted by defects in *critical areas* (e.g., contaminant in white space is fine)

| Minimally Activ | ve Interposers fo       | or Lai   | rge So         | terposer    |
|-----------------|-------------------------|----------|----------------|-------------|
|                 | Defect Density          | Low      | Medium         | High        |
|                 | Passive Interposer      | 98.5%    | 95.5%          | 92.7%       |
|                 | Active Interposer 1%    | 98.4%    | 95.4%          | 92.5%       |
|                 | Active Interposer 10%   | 98.0%    | 94.2%          | 90.7%       |
|                 | Fully-Active Interposer | 87.2%    | 68.5%          | 55.6%       |
|                 |                         | *Modelle | ed, not real y | /ield rates |

### Minimally Active Interposers for Large SoCs

|  |          |                         | 24                              |        |                  |  |
|--|----------|-------------------------|---------------------------------|--------|------------------|--|
|  |          | Defect Density          | Low                             | Medium | terposer<br>High |  |
|  |          | Passive Interposer      | 98.5%                           | 95.5%  | 92.7%            |  |
|  | <u> </u> | Active Interposer 1%    | 98.4%                           | 95.4%  | 92.5%            |  |
|  |          | Active Interposer 10%   | 98.0%                           | 94.2%  | 90.7%            |  |
|  |          | Fully-Active Interposer | 87.2%                           | 68.5%  | 55.6%            |  |
|  |          |                         | *Modelled, not real yield rates |        |                  |  |
|  |          |                         |                                 |        |                  |  |

### Minimally Active Interposers for Large SoCs




Active interposer is not free...

### Active interposer is not free...

... But appears practical if used judiciously

### Active interposer is not free...

... But appears practical if used judiciously

### So how should we design our NoC on interposer?









Link Utilization



#### **Double Butterfly**













#### **Bisection links are the primary bottleneck**



#### **Bisection links are the primary bottleneck**





#### **Bisection links are the primary bottleneck**



Folded Torus(X+Y)



**Misaligned ButterDonut(X)** 



Folded Torus(X+Y)





**Misaligned ButterDonut(X)** 



Folded Torus(X+Y)









Small # links\*

Folded Torus(X+Y)



**Misaligned ButterDonut(X)** 





Small # routers



Small # links\*



Folded Torus(X+Y)



**Misaligned ButterDonut(X)** 









Low average hop count

Folded Torus(X+Y)



**Misaligned ButterDonut(X)** 





A A

Small # routers

Small # links\*



Low average hop count

Large bisection bandwidth

Folded Torus(X+Y)



**Misaligned ButterDonut(X)** 



Small # routers

Small # links\*



Low average hop count

Large bisection bandwidth

Hybrid topology + misalignment gets you best of everything (almost)



Can design an interposer NoC topology to overcome disintegration-induced fragmentation of SoC



Can design an interposer NoC topology to overcome disintegration-induced fragmentation of SoC



Can design an interposer NoC topology to overcome disintegration-induced fragmentation of SoC





# Now we have a NoC that spans both chiplets and interposer



Now we have a NoC that spans both chiplets and interposer

Each chiplet may be designed independently



Active silicon interposer

Now we have a NoC that spans both chiplets and interposer

Each chiplet may be designed independently

#### Goals:

Free to choose NoC topology on chiplet

Free to choose local routing algorithm within chiplet (deadlock free)



Active silicon interposer

Now we have a NoC that spans both chiplets and interposer

Each chiplet may be designed independently

#### Goals:

Free to choose NoC topology on chiplet

Free to choose local routing algorithm within chiplet (deadlock free)

Active silicon interposer

Problem: Even though NoCs for chiplets and interposer are individually deadlock free, how do you ensure the final composed system is still correct?

## **Deadlock primer**


# **Deadlock primer**

### Deadlock

Can occur when packets are allowed to hold some resources while requesting others



# **Deadlock primer**

### Deadlock

Can occur when packets are allowed to hold some resources while requesting others

#### **Deadlock avoidance**

Avoid dependency cycles from forming Example: Turn restrictions



# **Deadlock in Chiplet-based Systems**

# Deadlocks can occur even if individual chiplets are deadlock free



# **Deadlock in Chiplet-based Systems**

# Deadlocks can occur even if individual chiplets are deadlock free



# **Deadlock in Chiplet-based Systems**

# Deadlocks can occur even if individual chiplets are deadlock free



#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)



#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)



#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

#### Local optimized chiplets

Allow local optimization independent of final SoC organization



#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

#### Local optimized chiplets

Allow local optimization independent of final SoC organization



#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

#### Local optimized chiplets

Allow local optimization independent of final SoC organization

# Lack info on other chiplets in system

3<sup>rd</sup> party may not want to share Other chiplets may not have been designed/finalized yet



#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

#### Local optimized chiplets

Allow local optimization independent of final SoC organization

# Lack info on other chiplets in system

3<sup>rd</sup> party may not want to share Other chiplets may not have been designed/finalized yet





#### Scalable analysis

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

#### Local optimized chiplets

Allow local optimization independent of final SoC organization

# Lack info on other chiplets in system

3<sup>rd</sup> party may not want to share Other chiplets may not have been designed/finalized yet





#### **Analysis scalability**

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

#### Local optimized chiplets

Allow local optimization independent of final SoC organization

# Lack info on other chiplets in system

3<sup>rd</sup> party may not want to share Other chiplets may not have been designed/finalized yet



#### **Analysis scalability**

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

optimized chiplete



Allo final

Loca

Chiplet composability is a HARD problem

# Lack into on other chiplets in

system

3<sup>rd</sup> party may not want to share

Other chiplets may not have been designed/finalized yet



#### **Analysis scalability**

Analyze entire composition of NoCs and all possible paths

Global channel dependency graph (CDG)

optimized chiplets



Allo fina

Loca

Chiplet composability is a HARD problem

#### Lack into on other chiplets in

Sy

Need a composable approach without global CDG

designea/inalizea yet

**Step 1: Abstract node** 



#### Step 1: Abstract node

Abstract rest of the system with a single node (key insight)



#### Step 1: Abstract node

Abstract rest of the system with a single node (key insight) Connect the chiplet to the abstract node



#### Step 1: Abstract node

Abstract rest of the system with a single node (key insight) Connect the chiplet to the abstract node



### **Step 2: Turn restrictions**

Apply turn restrictions only at boundary nodes



### **Step 2: Turn restrictions**

#### Apply turn restrictions only at boundary nodes

Inbound turn restrictions



### **Step 2: Turn restrictions**

#### Apply turn restrictions only at boundary nodes

- Inbound turn restrictions
- Outbound turn restrictions



### **Step 2: Turn restrictions**

Apply turn restrictions only at boundary nodes

- Inbound turn restrictions
- Outbound turn restrictions

Program chiplet routing tables for outbound messages



### **Step 2: Turn restrictions**

#### Apply turn restrictions only at boundary nodes

- Inbound turn restrictions
- Outbound turn restrictions

Program chiplet routing tables for outbound messages



# Messages must be routed through the correct boundary nodes

#### **Step 3: Reachability**



#### **Step 3: Reachability**



#### **Step 3: Reachability**



#### **Step 3: Reachability**



#### **Step 3: Reachability**

Propagate inbound reachabilities to the interposer (system integrator) Program interposer routing tables at integration



### **Step 3: Reachability**

Propagate inbound reachabilities to the interposer (system integrator) Program interposer routing tables at integration Interposer NoC must be deadlock-free by itself



### **Step 3: Reachability**

Propagate inbound reachabilities to the interposer (system integrator) Program interposer routing tables at integration Interposer NoC must be deadlock-free by itself



How to determine boundary router locations and turn restrictions?



### **Boundary router placement**

Physical constraints Load balancing Route distance

#### **Turn restriction**

Distance to/from boundary node Load balance

#### **Boundary router placement**

Physical constraints Load balancing Route distance

### **Turn restriction**

Distance to/from boundary node Load balance



#### **Boundary router placement**

Physical constraints Load balancing Route distance

### **Turn restriction**

Distance to/from boundary node Load balance


#### **Boundary router placement**

Physical constraints Load balancing Route distance

#### **Turn restriction**

Distance to/from boundary node Load balance







#### **Boundary router placement**

Physical constraints Load balancing Route distance

#### **Turn restriction**

Distance to/from boundary node Load balance

#### **Objective function**

Distance Reachability







#### **Boundary router placement**

Physical constraints Load balancing Route distance

#### **Turn restriction**

Distance to/from boundary node Load balance

#### **Objective function**

Distance Reachability







#### **Boundary router placement**

Physical constraints Load balancing Route distance

**Turn restriction** 

Loa

Dis

Rea





Distance to/from boundary node

**Our objective: minimize** 

average distance

average reachability





#### Does not require a CDG



#### Does not require a CDG Outperforms most prior work



#### Does not require a CDG

#### **Outperforms most prior work**

Room for improvement: load imbalance and head-of-line blocking



#### Does not require a CDG

#### **Outperforms most prior work**

Room for improvement: load imbalance and head-of-line blocking

# What are the open challenges and opportunities?



## Passive interposers currently in fashion

Can manufacture minimally active interposer with reasonable cost



## Passive interposers currently in fashion

Can manufacture minimally active interposer with reasonable cost



## Passive interposers currently in fashion

Can manufacture minimally active interposer with reasonable cost

## Opportunity to offload more functionality to interposer

System monitoring, security features, auxiliary compute devices



#### **Die-to-die variations in re-integrated SoC**

Additional timing or voltage margins Less efficient, lower performance



#### **Die-to-die variations in re-integrated SoC**

Additional timing or voltage margins Less efficient, lower performance



#### **Die-to-die variations in re-integrated SoC**

Additional timing or voltage margins Less efficient, lower performance

#### **Independent clock domains**

DVFS management Mitigate overheads of clock crossings



#### **Die-to-die variations in re-integrated SoC**

Additional timing or voltage margins Less efficient, lower performance

#### **Independent clock domains**

DVFS management Mitigate overheads of clock crossings



#### **Die-to-die variations in re-integrated SoC**

Additional timing or voltage margins Less efficient, lower performance

#### Independent clock domains

DVFS management Mitigate overheads of clock crossings



#### **3D NoC spanning multiple process technologies?**







#### **Alternative chiplet placements?**

Change NoC traffic patterns — new bottlenecks, new opportunities

#### **Alternative chiplet placements?**



Change NoC traffic patterns — new bottlenecks, new opportunities

## Interaction between in-package memory stacks and external memories?



### End of scaling

Rise of accelerators to provide performance, power efficiency, security

### End of scaling

Rise of accelerators to provide performance, power efficiency, security

#### Mix and match

Not all systems need every flavour of accelerator



### End of scaling

Rise of accelerators to provide performance, power efficiency, security

### Mix and match

Not all systems need every flavour of accelerator

### **Additional challenges**

Interfaces

- QoS
- Coherence





#### **Disintegrate chips** Build cost-effective LARGE SoCs



#### **Disintegrate chips**

Build cost-effective LARGE SoCs



#### Reintegrate with an active silicon interposer

Minimal active area to reduce cost

Novel NoC topologies to improve performance

#### **Disintegrate chips**

Build cost-effective LARGE SoCs



#### Reintegrate with an active silicon interposer

Minimal active area to reduce cost

Novel NoC topologies to improve performance

#### **Ensure composability**

Deadlock-free routing that allows chiplets to be optimized independently

#### **Disintegrate chips**

Build cost-effective LARGE SoCs



#### Reintegrate with an active silicon interposer

Minimal active area to reduce cost

Novel NoC topologies to improve performance

#### **Ensure composability**

Deadlock-free routing that allows chiplets to be optimized independently

**Open questions and opportunities for research!**
## Acknowledgements

#### **Graduate Students**

Ajay Kannan, Zimo Li

#### **Collaborators at AMD**

**Gabriel Loh, Jieming Yin**, Onur Karyiran, Matthew Poremba, Zhifeng Lin, Muhammad Shoaib Bin Altaf, Yasuko Eckert

### Funding

UofT, Natural Science and Engineering Research Council

# Thanks

Natalie Enright Jerger enright@ece.utoronto.ca