### System-level Exploration of Dynamical Clusteration for Adaptive Power Management in Network-on-chip

Liang Guang, Ethiopia Nigussie, Hannu Tenhunen, Dep. of Information Technology,

University of Turku, Finland

### Introduction

- Many-core platform with NoC as the communication structure is steadingly growing. More cores are being integrated with simpler each core being simpler. Examples: Teraflop 80-core, Tilera 64-core, ASAP 167-core.
- Realizing **multiple voltage and frequency islands** is an effective method to provide high power efficiency, as the workload in massively parallel platform has temporal and spatial variations.
- **Global communication** between cores is a major power consumer. Its contribution will constantly increase with the platform further parallelized into smaller units connected by a larger communication network.
- This work is an innovative yet initial exploration of realizing **dynamically clustered power management in many-core systems**. Integrating supporting power delivery and clocking techniques, clusters can be reconfigured at the real-time to tradeoff power and performance with minimized latency and power overhead.

### System Architecture

**Multiple On-chip Power Networks** 



Network regions dynamically configured into power domains

supported by

✓ Multiple on-chip power delivery networks

✓ Reconfigurable

inter-router links

## Multiple On-chip PDN(Power Delivery Networks)

- A scalable approach to provide adaptive power domain configuration
- Used in ASAP 167-core NoC (Truong et al. 2009)



• ASAP prototype results: 7 power grids are fabricated on M6/7 metal layers. The power switch only accounts for 4% in each tile's area.

(Truong et al. 2009) A 167-processor computational platform in 65nm CMOS. JSSC 44(4):1130-1144, 2009

### Reconfigurable Inter-Router Links (1)

Adaptive inter-router link structure reconfigurable for different power domain settings:

 $\checkmark$  In case both ends are configured into the same power domain, normal wire channels are enabled to minimize

✓ In case the ends are configured into different power domains, bi-synchronous FIFOs are needed for synchronization.



### Reconfigurable Inter-Router Links (2)

#### • Bi-synchronous FIFO

- ✓ The synchronization manner most convenient for CAD flow integration (for example DSPIN NoC)
- ✓ The more different clockings at the two ends are, the deeper FIFO is required to minimize metastability while ensuring certain throughput( Panades et al. 2007)
- Pseudochronous /Quasisynchronous clocking
- ✓ A special mesochronous timing with predictable and controllable constant phase shift between two adjacent nodes on regular layout NoC (öberg 2003)
- ✓ Used when two adjacent network regions configured with the same frequency
- ✓ Controllable skew without metastability issues .

Panades et al. 2007, Bi-synchronous FIFO for Synchronous Circuit communication Well Suited for NoC in GALS structures. In Proc. of NOCS2007.

*Öberg 2003, Clocking Strategies for Networks-on-Chip, Networs on Chip, 153-172, Kluwer Academics Publishers* 



Simplified view of bi-synchronous FIFO, highlighting most power-hungry datapath



Illustration of Pseudochronous clocking (*öberg 2003*)

### Dynamic Clusterization Steps (1)



- 1) The traffic condition of each region needs to be collected
- 2) Dynamic clusters are identified

- 3) The boundary links of the clusters are configured with FIFO-based channels
- 4) Switching to the proper Vdd and clock

## Dynamic Clusterization Steps (2)

#### 1) Run-time traffic condition collection

- ✓ The traffic load of each region, averaged in a history window needs to be collected by a central monitor
- ✓ Such traffic load reporting will be generalized into monitoring flow. With relatively long reporting interval, the overhead is minimal. The detailed implementation is initially explored in (Guang et al. 2008)

### 2) Dynamic cluster identification

#### Cluster 1

✓ Search for the largest cluster (minimizing the Cluster 2 Load Load Load Load Load interface overhead) Cluster 4 Cluster 3 Load Load Load Load Load  $\checkmark$  Managed by the central Load Load Load Load Load monitor with the traffic Load Load Load Load Load information collected

Guang et al. 2008, Low-latency and Energy-efficient Monitoring Interconnect for Hierarchicalagent-monitored NoCs. In Proc. Norchip 2008.

## **Dynamic Clusterization Steps (3)**

### **3)** Interface reconstruction

- ✓ The links on the boundaries of the identified clusters need to enable FIFO-based connection.
- ✓ The reconstruction has to be done before switching to new Vdd and clocking.

### 4) New supply reconfiguration

✓ Reconfigure the power switches to the proper Vdd, and the PLLs with proper clocking output.

# Experiment Setup (1)

#### Network Configuration

- ✓ 8\*8 mesh NoC, STF switching, X-Y routing
- ✓ 64-bit wires, 1mm long
- ✓ FIFO depth 6 (to ensure 100% throughput in asynchronous timing; Panades et al. 2007)

### Power Estimation

- ✓ Two voltage/frequency pairs (0.6G, 0.6V), (1.2G, 1.5)
- ✓ Router and normal wiring energy estimated by Orion 2.0  $\bullet$
- ✓ FIFO access energy estimated by the buffer energy in a router, latency modelled by Panades et al. 2007.

### • DVFS algorithm setting

- $\checkmark$  The traffic load is averaged and reported every 50 cycles
- ✓ By default, the low voltage/frequency pair is used. When the average buffer load is above a threshold, the high voltage/frequency pair is used.

## Experiment Setup (2)

#### Energy/performance tradeoff monitoring buffer load (Guang&Jantsch2006)

✓ Buffer load is a simple and direct indicator of the network performance.
✓ Lower frequency leads to higher buffer load (given same input traffic), with lower energy consumption.
✓ The exact curve of buffer load vs. latency varies based on the network configuration

 $\checkmark$  The tradeoff is dependent by the latency tolerance of the processing elements.



Buffer Load vs. Latency (8\*8 NoC, STF switching, X-Y routing)

*Guang&Jantsch 2006, Adaptive power management for the on-chip communication network, In Proc. of DSD2006.* 

### **Traffic Patterns**

Type 1. Uniform Traffic

Type 4. Hotspot traffic with a different hotspot location



Type 3. Hotspot Traffic

(as Type 2), but with locality destination pattern (Lu et al. 2008)

Type 2. Hotspot Traffic

Type 5: Same spatial variation as Type 4, but with a higher input traffic

Type 6: Same spatial variation as Type 5, but with even higher input traffic

Lu et al. 2008. Network-on-chip benchmarking specification part 2: Microbenchmark specification version 1.0. Technical report, OCP International Partnership Association, 2008.

# Evaluation (1)

#### **Alternative Architectures**

#### • PNDVFS (Per-Network DVFS)

- ✓ The whole NoC is configured with lower power supply if the general traffic load is low
- Most simple manner of DVFS with no synchronization overhead (Guang&Jantsch 2006)

#### • SCDVFS (Static-clustered DVFS)

- Clusters are partioned at design time.(Guang et al. 2008)
- Per-core DVFS
- Conventional per-core DVFS with static synchronization interface is too "expensive".
- ✓ Potential per-core DVFS with reconfigurable links requires further analysis in avoiding frequent scaling.

Uniform partition for SCDVFS

00000000

 $\bigcirc$ 

|                  | Average Energy<br>Per-flit (e-10J) | Average<br>Latency Per-<br>flit (Cycles) |
|------------------|------------------------------------|------------------------------------------|
| Router<br>+ Link | 6.24                               | 16.83                                    |
| FIFO             | 1.96                               | 18.33                                    |
| Increase         | 31%                                | 112%                                     |

Initial Exploration of Overheads using Conventional Per-core DVFS

*Guang et al. 2008. Autonomous DVFS on Supply Islands for Energy-constrained NoC Communication, LNCS 5545, 2008* 

# Evaluation (2)

- Energy comparison
- ✓ In general, DCDVFS achieves lower average energy
- Except for uniform traffic with no spatial or temporal variation,
   FIFO overhead leads to more energy consumption
- ✓ More varying and unpredictably distributed the traffic, the higher energy benefit (T4-T6)
- ✓ The major overhead comes from the FIFO.



Comparison of Average Energy (Normalized) of Three DVFS Architectures

# **Evaluation (3)**

#### **FIFO energy overhead**

✓ For DCDVFS, the FIFO contributes to significant energy overhead

✓ Despite such overhead, the energy is still lowered because of lowered running frequency

✓ For SCDVFS, the FIFO contributes smaller percentage of energy, due the larger cluster size
 ✓ No FIFO exists for PNDVFS



FIFO energy overhead for three DVFS architectures

## **Evaluation (4)**

**Average Latency comparison of three DVFS architectures** (Normalized with PNDVFS)

| Traffic<br>Trace | SCDVFS | DCDVFS | FIFO<br>overhead |
|------------------|--------|--------|------------------|
| <b>T1</b>        | 1.09   | 1.49   | 24%              |
| T2               | 1.04   | 1.68   | 26%              |
| <b>T3</b>        | 1.09   | 1.17   | 10%              |
| <b>T4</b>        | 0.80   | 1.45   | 19%              |
| T5               | 1.03   | 1.40   | 18%              |
| <b>T6</b>        | 0.93   | 1.32   | 11%              |

✓ Natural consequence of lowered switching frequency

 $\checkmark$  Predictably bounded latency increase because of the congestion avoidance

✓ Significant FIFO latency overhead

# **Evaluation (5)**



### Area comparison of thee DVFS architectures

- ✓ DCDVFS needs more area for the reconfigurable links.
- $\checkmark$  The increase is reasonable considering the whole die area
- Tradeoff of silicon area to gain power efficiency (power budget > transistor and wiring limitation)

## Conclusion

- Run-time reconfiguration leads to better power efficiency
- For fast-growing massively parallel on-chip platform, run-time clusterization for applying adaptive power-management schemes is particularly useful to reduce the synchronization overhead
- System-level exploration is necessary before time-consuming low-level implementation
- Future study focuses on:
- ✓ Further design choice exploration, for instance timing analysis of each configuration step
- ✓ Circuit-level modeling of essential structures (reconfigurable link structure, pseudosynchronous clocking, etc..)