# Power Efficiency of Wavelength-Routed Optical NoC Topologies for Global Connectivity of 3D Multi-Core Processors

Luca Ramini and Davide Bertozzi



### PHOTONICA PROJECT



Italian Government under the "FIRB-Futuro in Ricerca" program

## **Optical On-Chip Communication**

For the first time, the photonic elements necessary to build a complete <u>on-chip Optical Communication Infrastructure</u> such as Modulators, Photodetectors, CMOS Drivers and Receivers are today viable for integration on a silicon chip



Target Platform for chip-scale optical interconnect technology: **3D stacking of processing, memory and optical layers** 

# Background: Space-Routed ONoCs



✓ Optical path control (Shacham'07) is expensive (hybrid NoC, path setup latency/contention)

0 0

 Might not be the most appropriate mechanism for cost- and/or latency-constrained communications (control applications where response time is the key metric, Akesson2011)

# **Our focus:** Wavelength-Routed ONoCs

### WAVELENGTH-SELECTIVE ROUTING

- Packet routing depends solely on the wavelength of its carrier signal.
- Path is <u>configured at design time</u> for a source-destination pair.
- □ It does not depend on ongoing transmissions by other nodes.
- □ No time is spent in Routing/ Arbitration.
- □ Enable Contention-Free Full Connectivity without needing for any path setup/teardown overhead.

#### VIRTUAL VIEW



### **CHALLENGES:**

- ✓ HARD TO SCALE TO A LARGE NUMBER OF COMMUNICATION ACTORS
- ✓ MAINLY PROPOSED IN TERMS OF LOGIC SCHEME SO FAR, OVERLOOKING PHYSICAL IMPLEMENTATION EFFECTS



WE ASSESS THE DEVIATION BETWEEN LOGIC SCHEME AND ITS PHYSICAL IMPLEMENTATION UNDER THE EFFECT OF <u>PLACEMENT CONSTRAINTS</u> TARGETING REAL LIFE SYSTEMS (e.g. 64 cores)

# Key Concern: The Predictability Gap

### This work intends to quantify the **Design Predictability Gap** of Wavelength-Routed Optical NoCs (WR-ONoCs)

under the effect of placement and routing constraints in the target system.



Physical layer awareness enables to quantify the deviation of the "physical topology" from its "logic connectivity scheme" (not just a matter of efficiency, but even of feasibility!)

### **Key Contributions: Placement Constraints**



### Key effect this work is going to quantify:

The number of waveguide crossings on the actual **layout may be much larger than in the logic** scheme due to the mapping constraint on a 2D surface

THE INSERTION LOSSES and LASER POWER REQUIREMENTS may WORSEN to such an extent that an elegant logic scheme may <u>turn out to be overly expensive and even unfeasible</u>

These effects are tightly design-specific, hence urging the choice for an **experimental setting**: **Processor-memory communication in a 3D stacked multi-core processor** 

# Target Architecture: 3D Stacked Multi-core Processor

PROMISING SCENARIO FOR COST-EFFECTIVE INTEGRATION OF HETEROGENEOUS TECHNOLOGIES.



## **Target Architecture: The Electronic Layer**

The Electronic Layer consists of 64 homogeneous processor cores connected by an Electronic NoC with a 2D Mesh Topology.

### **ASSUMPTIONS**

- Cores are grouped into <u>4 clusters Ci of 16 cores</u>
  <u>each</u>
- Each cluster has its own access to the optical layer which is vertically stacked on top of the electronic layer
- Core size is <u>1mm x 1mm</u>
- Die size is <u>8mm x 8mm</u>

E-NoC: 64 cores connected to a 2DMesh



Clusters and Aggregation Factor

The **number of cores** inside each cluster represents the **Aggregation Factor** 

# **Target Architecture: The Optical Layer**

The Cluster <u>Gateways</u> to the optical layer are defined as the *Hubs (Hi)* Wavelength Sharing: the same wavelengths can be shared by all the Initiators. Optical Power: is provided by an array of off-chip Continuous Wave (CW) lasers.



The **Optical Layer** offers **three kinds** of communications:

(a) Among Clusters

(b) From a cluster to a Memory Controller of an off-chip DRAM DIMM

(c) From a Memory Controller to a Cluster

## **PLACEMENT CONSTRAINTS**



Placement Constraints : The <u>Memory Controllers</u> are positioned pairwise at opposite positions of the chip <u>thus reflecting a conventional industrial practice</u> (e.g. Tilera TILE64)

## Wavelength-Routed Optical NoCs Topologies (WRONoCs)

The optical layer makes use of **8 initiators** that have to communicate with **8 targets** 

We need to connect **4 hubs and 4 Memory Controllers** with the target interface of the same **4 hubs and 4 Controllers**.

We leverage on a Wavelength-Routed Optical NoC to deliver all kinds of communications in the optical layer, namely Inter-Cluster, Off-Chip Memory Access Request and Memory Responses (Global Connectivity Scenario).

THE MOST RILEVANT WRONoC LOGIC SCHEMES WERE EXPLORED IN OUR TARGET ARCHITECTURE WITH AWARENESS OF PLACEMENT CONSTRAINTS



## **WRONoC: 8x8 λ-Router Logic Scheme**

### Such assumptions are **somewhat unrealistic**

Initiators are placed at the leftmost side of the Network



Targets are placed at the rightmost side of the Network

0 0

A. Scandurra and I.O'Connor, "Scalable CMOS-compatible photonic routing topologies for versatile networks on chip",NoC-Architeture, 2008

- In order to connect 8 Initiators with 8 Targets, the network utilizes 8 stages of 4 and 3 add-drop optical filters.
- \* 8 different wavelengths are needed to satisfy all communication requirements.
- 8x8 λ-Router reflects the connectivity pattern of <u>unidirectional Multi-Stage</u>
  <u>Networks (MINs)</u> in the electronic domain, where the inter-stage pattern is closely related to the Routing Methodology of the WRONoCs.

## **8x8** λ-Router Physical View



8x8 λ-Router Logic Scheme

under the effect of Placements and Routing constraints

### **DESIGN GUIDELINES FOR MANUAL LAYOUT**

1) Satisfy physical placement of network interfaces.

**2)** Homogeneously spread all building blocks throughout the 2D surface: at least 11 MRRs for a quarter of a chip (while the total number of MRRs is 56).

3) Place optical filters close to the initiators, targets or other connected filters.

4) Route optical waveguides so to minimize waveguide crossings.

### **8x8** λ-Router Physical View



under the effect of Placements and Routing constraints

8x8 λ-Router Logic Scheme

THE ULTIMATE EFFECT IS AN INCREASE OF INSERTION-LOSS STRICTLY DOMINATED BY THE WAVEGUIDE INTERSECTIONS

# 8x8 GWOR Topology



X. Tan et al "On a Scalable, Non Blocking Optical Router for Photonic Network-on-Chip Designs" , SOPO, 2011.

#### - 8x8 GWOR <u>is constructed starting from its</u> basic cell, the 4x4 GWOR.

- 4x4 GWOR consists of <u>4 waveguides</u> which intersect each other , where <u>MRRs are placed</u> <u>pairwise at each intersection</u>.
- Initiator and Targets are arranged <u>around</u> <u>all cardinal points</u>.
  - <u>Self-communication</u> is not allowed.

#### 8x8 GWOR Real Layout



- PLACEMENT CONSTRAINTS of the Target
  System significantly deviate from those of the logic scheme.
- <u>Circuitous Layout</u> makes the logic scheme <u>hardly recognizable.</u>
- Noticeable increase of waveguide crossings as an effect of the 2D surface mapping.

# 8x8 Folded Crossbar Topology



#### 4x4 Folded Crossbar Logic Scheme

### LENGTH OVERHEAD

### Apparent effect of the Logic Scheme,

since the Real Layout is instead facilitated

**Every Initiator** can in fact drive an optical waveguide that enters a **RING-LIKE TOPOLOGY** 

This topology lends itself to an interesting optimizzation already in its logic scheme.

By changing this ORDER for every **Initiator** (see Above), then we cause a waveguide LENGTH OVERHEAD.

# **Layout-Level Physical Details**



- 8x8 Folded Crossbar Layout is much more regular than that of 8x8 λ-Router and the 8x8 GWOR.
- In the 8x8 Folded Crossbar , MRRs are positioned close to communication targets, thus facilitating the Wavelength-Selective Ejection of optical signals.

#### IN PREVIOUS COMPARISON FRAMEWORKS SUCH LAYOUT-LEVEL DETAILS ARE TIPICALLY OMITTED

| TOPOLOGY            | Total # of MRRs | MAX # of Crossings<br>Logic Scheme | MAX # of Crossings<br>Real Layout |
|---------------------|-----------------|------------------------------------|-----------------------------------|
| 8x8 λ-Router        | 56              | 7                                  | 64                                |
| 8x8 GWOR            | 48              | 10                                 | 72                                |
| 8x8 Folded Crossbar | 44              | 18                                 | 22                                |

# **Experimental Results: The Insertion loss**

The Insertion-Loss must be quantified to determine the requirement on laser power that guarantees a predifinied BER at receivers



The **INSERTION LOSS** of Optical NoCs by modeling **every single path** of a given topology.

> The INSERTION LOSS critical path (ILmax) across the entire global network.

We make the practical assumption that all laser sources are sized based on this. (ILmax)

### **Experimental Results: Insertion Loss Comparison**



GWOR suffers from 72 crossings against the 10 expected ones(crossing-dominated <u>Topology</u>).

**□λ**-Router reports 64 crossings vs. 7 in the logic scheme(crossing-dominated)

<u>Topology</u>).

□ Folded Crossbar Logic Scheme is worse than any other topology(well-known).

□Surprisingly Folded Crossbar maps more efficiently to the target placement constraints.

## **Experimental Results: Total Power**



Lasers Power, Modulators Power, Receivers Power and Thermal Tuning. The Total power of GWOR is larger than that of other topologies, even if the λ-Router utilizes one laser more than GWOR and CROSSBAR for providing the same connectivity. The total power of the λ-Router is 2.47 times lower than the GWOR one. Folded Crossbar turns out to be the most power efficient solution since it consumes only 276mW, almost 2 orders of magnitute lower than GWOR(16.6W).

## **Total Power Breakdown**

#### A LARGER CONTRIBUTION OF INSERTION LOSS LEADS TO AN INCREASE OF LASERS AND MODULATORS POWER, THUS BECOMING DOMINANT IN THE BREAKDOWN...



CROSSBAR RESULTS INTO LOW ER POWER AND ITS RECEIVERS DOMINATE THE BREAKDOWN

# What happens when a Ring Topology is used?

#### 7-WAY RING TOPOLOGY REAL LAYOUT



Easiest solution due to its **simplicity** and lower **implementation cost** 

THE ONLY ONE WAY TO ESTABLISH WHETHER THE **8X8 FOLDED CROSSBAR IS THE BEST SOLUTION** CONSISTS OF COMPARING IT WITH A **RING TOPOLOGY**....

#### ASSUMPTION

- WE DESIGN A RING ASSUMING **7 AVAILABLE WAVELENGTHS** AS FOR THE CROSSBAR TOPOLOGY

USING MULTIPLE WAVEGUIDES (i.e. spatial division multiplexing)

IS THE ONLY WAY TO MEET THIS REQUIREMENT

**RING TOPOLOGY BETTER FITS** THE PLACEMENT CONSTRAINTS.

□ **RING TOPOLOGY** WORKS LIKE **A BUS** IN WHICH MULTIPLE WAVEGUIDES ARE CONTAINED INTO IT.

□ 7 PARALLEL WAVEGUIDES ARE NEEDED TO DELIVER CONTENTION FREE GLOBAL CONNECTIVITY.

# PLANNING THE OPTICAL RING

In any Ring topology there are not crossings in principle Actually , they are necessary at Initiator interfaces to connect to the parallel waveguides that are furthest away from the <u>injection point</u>



• •

NOTICE THAT THE LOGIC SCHEME OF ANY RING TOPOLOGY FEATURES SUCH CROSSING WAVEGUIDES , THUS DEGRADING INSERTION LOSS AND THE TOTAL POWER

AT THE TARGET INTERFACES NO CROSSING APPEARS

(Output signals of photodetectors (PDs) directly leave the optical plane by leveraging TVSs)

## **Experimental Results: Ring vs. Folded Crossbar**



7-way Ring is 50% more efficient than Crossbar due to lower Wiring Length (2cm vs.2.55cm) and lower number of crossings, (9 vs. 22).

7-way Ring is 30% more power efficient than Crossbar

The gap of 50% in terms of insertion loss is limited to 30% of total power due to the significant contribution of optical receivers to the breakdown:<u>63%</u> in the Crossbar topology and 89 % in the Ring one.

#### **RING IS** AN **APPEALLING SOLUTION** FOR THE CONSIDERED SYSTEM (**64 CORES**)

# Conclusions

□ This paper focuses on **Design Predictability Concern in Optical Network-on-Chip design that arises from the need to meet specific PLACEMENT CONSTRAINTS**.

□ Case Study: processor-memory communication in a 3D stacked system

Experimental Results show large deviations of Insertion-loss from the logic scheme to the physical implementation as an effect of placement constraints.

A spatial -division multiplexed Ring turn out to be the most power efficient solution, followed by an optimized crossbar.

The presented Results also show that ABSTRACT AND EVEN PENCIL-AND-PAPER FLOORPLANNING considerations are not suitable to predict network quality metrics

➢An AUTOMATIC PLACE & ROUTE TOOL is a must to overcome the MANUAL-INTENSIVE characterizzation process of Insertion-Loss, and Power degradations to consider <u>Placement Costraints</u> and <u>Physical implementation Trade-offs.</u>

# **Future Works**

□ Scalability Concerns for Optical Ring Topologies will be the focus of our FUTURE WORK



- IN PARALLELL, LARGE DIE WILL LEAD TO HIGHER PROPAGATION LOSS, THUS RAISING ANOTHER CONCERN: **THE WIRING LENGTH OVERHEAD**.

U We will also address Scalability of Wavelength-Routed Optical NoC Topologies targeting:

- NETWORK PARTITIONING - WAVELENGTH REUSE

Together with TECHNICAL UNIVERSITY OF MUNICH, we are working on an AUTOMATED PLACE&ROUTE TOOLFLOW for OPTICAL NOCs, in an attempt to bridge a significant GAP in the field CONTRASTING POWER EFFICIENCY OPTICAL NOC VS. ELECTRICAL NOC

# ACKNOWLEDGEMENTS

This work has been supported by the **PHOTONICA project** under the "**FIRB-Futuro in Ricerca**" program, funded by the Italian Government . This work **would like to thank** all researchers who are joined the project:

<u>Coordinator</u>: **Davide Bertozzi** (University of Ferrara, Italy). <u>Partner</u>: **Gaetano Bellanca** (University of Ferrara, Italy). <u>Partner</u>: **Giovanna Calò** (Politecnico of Bari, Italy). <u>Partner</u>: **Sandro Bartolini** (University of Siena, Italy).

## THANKS TO EVERYONE

Luca Ramini (luca.ramini@unife.it)

# Backup

### LOSS PARAMETERS

| Physical Components                                                   | Loss Parameters |  |
|-----------------------------------------------------------------------|-----------------|--|
| Optical Link<br>(from literature)                                     | 1.5 dB /cm      |  |
| Bend Waveguide<br>(from literature)                                   | 0.005 dB        |  |
| Crossing Waveguide<br>Optimized by<br>Elliptical Taper<br>(From FDTD) | 0.52 dB         |  |
| Drop<br>Optimized by<br>Elliptical Taper<br>(From FDTD)               | 0.013 dB        |  |

# **Device Parameters**

| Device                                      | Features                                                                                                                                              |
|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| LASER                                       | CW (Continuous Wave)<br>Laser Efficiency<br>PLE=20%<br>Coupling Laser-Link<br>PCW=90%                                                                 |
| MODULATOR                                   | Silicon-Disk<br>Launch-Efficiency β=20%<br>Dyn.Dissipation = 3fj/bit<br>Static Power=30W<br>Vdd=1V<br>Modulator-Power depends on<br>ILmax, (see [14]) |
| DETECTOR                                    | CMOS(45nm)<br>Hybrid Silicon Receiver<br>Sensitivity, S=-17dBm<br>(BER=10 power(-12)<br>@10Gbit/s)<br>Power=3.95mW<br>(see [13])                      |
| PSEs<br>Photonic-<br>Switching-<br>Elements | Thermal Tuning = 20µW/ring<br>(see [10])                                                                                                              |