## Rethinking Memory System Design (along with Interconnects)

Onur Mutlu

onur@cmu.edu

http://users.ece.cmu.edu/~omutlu/

December 5, 2015 NoCArc 2015 Keynote





## Interconnect Papers MICRO 2012, 2015



**SAFARI** 

## Interconnect Papers MICRO 2012, 2015





## Memory System Papers MICRO 2012, 2015



SAFARI

## Let's Put It in Context: MICRO 2015



## The Obvious Question

Is interconnect unimportant?





No.

- On the contrary, it is critical.
- It is a fundamental block that affects many things in the system.
- Yet, we need to examine and enhance it within the context of the system.



## Some Examples

- Application-awareness in the interconnect [Das+ міско'09, нрса'13]
- Focus on critical requests [Aergia, Das+ ISCA'10]
- Quality of Service and predictability [Grot+ MICRO'09, ISCA'11]
- Efficient data movement [BLESS ISCA'09, Chang+ HPCA'16]
- Interconnect design for memory systems (DRAM, hybrid memory, NVM) [Lee+ HPCA'13, Chang+ HPCA'16]



## Challenge for Interconnects

# Memory-Centric Interconnect Design



## The Main Memory System



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

## Memory System: A Shared Resource View



## State of the Main Memory System

- Recent technology, architecture, and application trends
  - lead to new requirements
  - exacerbate old requirements
- DRAM and memory controllers, as we know them today, are (will be) unlikely to satisfy all requirements
- Some emerging non-volatile memory technologies (e.g., PCM) enable new opportunities: memory+storage merging
- We need to rethink the main memory system
   to fix DRAM issues and enable emerging technologies
   to satisfy all requirements

#### SAFARI



- Major Trends Affecting Main Memory
- The Memory Scaling Problem and Solution Directions
  - New Memory Architectures
  - Enabling Emerging Technologies
- How Can We Do Better?
- Summary



## Major Trends Affecting Main Memory (I)

Need for main memory capacity, bandwidth, QoS increasing

#### Main memory energy/power is a key system design concern

#### DRAM technology scaling is ending



## Major Trends Affecting Main Memory (II)

- Need for main memory capacity, bandwidth, QoS increasing
  - Multi-core: increasing number of cores/agents
  - Data-intensive applications: increasing demand/hunger for data
  - Consolidation: cloud computing, GPUs, mobile, heterogeneity

Main memory energy/power is a key system design concern

DRAM technology scaling is ending



## Example: The Memory Capacity Gap

Core count doubling ~ every 2 years DRAM DIMM capacity doubling ~ every 3 years



- Memory capacity per core expected to drop by 30% every two years
- Trends worse for memory bandwidth per core!

## Major Trends Affecting Main Memory (III)

Need for main memory capacity, bandwidth, QoS increasing

- Main memory energy/power is a key system design concern
  - ~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer 2003]
  - DRAM consumes power even when not used (periodic refresh)
- DRAM technology scaling is ending



## Major Trends Affecting Main Memory (IV)

Need for main memory capacity, bandwidth, QoS increasing

Main memory energy/power is a key system design concern

#### DRAM technology scaling is ending

- ITRS projects DRAM will not scale easily below X nm
- Scaling has provided many benefits:
  - higher capacity (density), lower cost, lower energy





- Major Trends Affecting Main Memory
- The Memory Scaling Problem and Solution Directions
  - New Memory Architectures
  - Enabling Emerging Technologies
- How Can We Do Better?
- Summary



## The DRAM Scaling Problem

- DRAM stores charge in a capacitor (charge-based memory)
  - Capacitor must be large enough for reliable sensing
  - Access transistor should be large enough for low leakage and high retention time
  - Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]



DRAM capacity, cost, and energy/power hard to scale

## An Example of the DRAM Scaling Problem



Repeatedly opening and closing a row enough times within a refresh interval induces **disturbance errors** in adjacent rows in **most real DRAM chips you can buy today** 

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

# Most DRAM Modules Are at Risk A company B company C company







Up toUp toUp to $1.0 \times 10^7$  $2.7 \times 10^6$  $3.3 \times 10^5$ errorserrorserrors

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

























## Observed Errors in Real Systems

| <b>CPU Architecture</b>   | Errors | Access-<br>Rate |
|---------------------------|--------|-----------------|
| Intel Haswell (2013)      | 22.9K  | 12.3M/sec       |
| Intel Ivy Bridge (2012)   | 20.7K  | 11.7M/sec       |
| Intel Sandy Bridge (2011) | 16.1K  | 11.6M/sec       |

- AMDA Piledrivert(2012) curity issue 6.1M/sec
- In a more controlled environment, we can induce as many as ten million disturbance

27

Kim+ Chipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

## Errors vs. Vintage



## Errors vs. Vintage



## Experimental DRAM Testing Infrastructure



Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case (Lee et al., HPCA 2015)

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015) An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms (Liu et al., ISCA 2013)

<u>The Efficacy of Error Mitigation Techniques</u> <u>for DRAM Retention Failures: A</u> <u>Comparative Experimental Study</u> (Khan et al., SIGMETRICS 2014)



#### SAFARI

## Experimental Infrastructure (DRAM)



SAFARI

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

RowHammer Characterization Results

- 1. Most Modules Are at Risk
- 2. Errors vs. Vintage
- 3. Error = Charge Loss
- 4. Adjacency: Aggressor & Victim
- 5. Sensitivity Studies
- 6. Other Results in Paper
- 7. Solution Space

#### One Can Take Over an Otherwise-Secure System

### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Abstract. Memory isolation is a key property of a reliable and secure computing system — an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology

# **Project Zero**

<u>Flipping Bits in Memory Without Accessing Them:</u> <u>An Experimental Study of DRAM Disturbance Errors</u> (Kim et al., ISCA 2014)

News and updates from the Project Zero team at Google

Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)

Monday, March 9, 2015

Exploiting the DRAM rowhammer bug to gain kernel privileges

## RowHammer Security Attack Example

- "Rowhammer" is a problem with some recent DRAM devices in which repeatedly accessing a row of memory can cause bit flips in adjacent rows (Kim et al., ISCA 2014).
  - Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)
- We tested a selection of laptops and found that a subset of them exhibited the problem.
- We built two working privilege escalation exploits that use this effect.
  - Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)
- One exploit uses rowhammer-induced bit flips to gain kernel privileges on x86-64 Linux when run as an unprivileged userland process.
- When run on a machine vulnerable to the rowhammer problem, the process was able to induce bit flips in page table entries (PTEs).
- It was able to use this to gain write access to its own page table, and hence gain read-write access to all of physical memory.

## Security Implications



It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after

## Apple's Patch for RowHammer

#### https://support.apple.com/en-gb/HT204934

Available for: OS X Mountain Lion v10.8.5, OS X Mavericks v10.9.5

Impact: A malicious application may induce memory corruption to escalate privileges

Description: A disturbance error, also known as Rowhammer, exists with some DDR3 RAM that could have led to memory corruption. This issue was mitigated by increasing memory refresh rates.

CVE-ID

CVE-2015-3693 : Mark Seaborn and Thomas Dullien of Google, working from original research by Yoongu Kim et al (2014)

## Recap: The DRAM Scaling Problem

#### **DRAM Process Scaling Challenges**

#### \* Refresh

Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance
THE MEMORY FORUM 2014

## Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling

Uksong Kang, Hak-soo Yu, Churoo Park, \*Hongzhong Zheng, \*\*John Halbert, \*\*Kuljit Bains, SeongJin Jang, and Joo Sun Choi



#### How Do We Solve The Problem?



## Solution 1: Fix DRAM

- Overcome DRAM shortcomings with
  - System-DRAM co-design
  - Novel DRAM architectures, interface, functions
  - Better waste management (efficient utilization)
- Key issues to tackle
  - Enable reliability at low cost
  - Reduce energy
  - Improve latency and bandwidth
  - Reduce waste (capacity, bandwidth, latency)
  - Enable computation close to data

## Solution 1: Fix DRAM

- Liu+, "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012.
- Kim+, "A Case for Exploiting Subarray-Level Parallelism in DRAM," ISCA 2012.
- Lee+, "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture," HPCA 2013.
- Liu+, "An Experimental Study of Data Retention Behavior in Modern DRAM Devices," ISCA 2013.
- Seshadri+, "RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data," MICRO 2013.
- Pekhimenko+, "Linearly Compressed Pages: A Main Memory Compression Framework," MICRO 2013.
- Chang+, "Improving DRAM Performance by Parallelizing Refreshes with Accesses," HPCA 2014.
- Khan+, "The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study," SIGMETRICS 2014.
- Luo+, "Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost," DSN 2014.
- Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.
- Lee+, "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case," HPCA 2015.
- Qureshi+, "AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems," DSN 2015.
- Meza+, "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field," DSN 2015.
- Kim+, "Ramulator: A Fast and Extensible DRAM Simulator," IEEE CAL 2015.
- Seshadri+, "Fast Bulk Bitwise AND and OR in DRAM," IEEE CAL 2015.
- Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," ISCA 2015.
- Ahn+, "PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture," ISCA 2015.
- Lee+, "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM," PACT 2015.
- Seshadri+, "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," MICRO 2015.
- Avoid DRAM:
  - Seshadri+, "The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing," PACT 2012.
  - Pekhimenko+, "Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches," PACT 2012.
  - Seshadri+, "The Dirty-Block Index," ISCA 2014.
  - Dekhimenko+, "Exploiting Compressed Block Size as an Indicator of Future Reuse," HPCA 2015.
  - Vijaykumar+, "A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps," ISCA 2015.

## Solution 2: Emerging Memory Technologies

- Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile)
- Example: Phase Change Memory
  - Expected to scale to 9nm (2022 [ITRS])
  - Expected to be denser than DRAM: can store multiple bits/cell
- But, emerging technologies have shortcomings as well
   Can they be enabled to replace/augment/surpass DRAM?
- Lee+, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA'09, CACM'10, Micro'10.
- Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters 2012.
- Yoon, Meza+, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012.
- Kultursay+, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative," ISPASS 2013.
- Meza+, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory," WEED 2013.
- Lu+, "Loose Ordering Consistency for Persistent Memory," ICCD 2014.
- Zhao+, "FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems," MICRO 2014.
- Yoon, Meza+, "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories," ACM TACO 2014.
- Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems," MICRO 2015.

## Solution 3: Hybrid Memory Systems



Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al., "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012 Best Paper Award.

#### Exploiting Memory Error Tolerance with Hybrid Memory Systems



On Microsoft's Web Search workload Reduces server hardware cost by 4.7 % Achieves single server availability target of 99.90 % Heterogeneous-Reliability Memory [DSN 2014]

## An Orthogonal Issue: Memory Interference



Cores' interfere with each other when accessing shared main memory

### An Orthogonal Issue: Memory Interference

- Problem: Memory interference between cores is uncontrolled
  - $\rightarrow$  unfairness, starvation, low performance
  - $\rightarrow$  uncontrollable, unpredictable, vulnerable system
- Solution: QoS-Aware Memory Systems
  - Hardware designed to provide a configurable fairness substrate
    - Application-aware memory scheduling, partitioning, throttling
  - Software designed to configure the resources to satisfy different QoS goals
- QoS-aware memory systems can provide predictable performance and higher efficiency



#### Goal: Predictable Performance in Complex Systems



- Heterogeneous agents: CPUs, GPUs, and HWAs
- Main memory interference between CPUs, GPUs, HWAs

How to allocate resources to heterogeneous agents to mitigate interference and provide predictable performance?

#### Strong Memory Service Guarantees

- Goal: Satisfy performance/SLA requirements in the presence of shared main memory, heterogeneous agents, and hybrid memory/storage
- Approach:
  - Develop techniques/models to accurately estimate the performance loss of an application/agent in the presence of resource sharing
  - Develop mechanisms (hardware and software) to enable the resource partitioning/prioritization needed to achieve the required performance levels for all applications
  - □ All the while providing high system performance
- Subramanian et al., "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems," HPCA 2013.
- Subramanian et al., "The Application Slowdown Model," MICRO 2015.

## Challenge for Interconnects

# QoS and Predictability



## Some Promising Directions

- New memory architectures
  - Rethinking DRAM and flash memory
  - A lot of hope in fixing DRAM

#### Enabling emerging NVM technologies

- Hybrid memory systems
- Single-level memory and storage
- A lot of hope in hybrid memory systems and single-level stores
- System-level memory/storage QoS
  - A lot of hope in designing a predictable system



- Major Trends Affecting Main Memory
- The Memory Scaling Problem and Solution Directions
  - New Memory Architectures
  - Enabling Emerging Technologies
- How Can We Do Better?
- Summary



## Rethinking DRAM

- In-Memory Computation
- Refresh
- Reliability
- Latency
- Bandwidth
- Energy
- Memory Compression

## Why In-Memory Computation Today?

- Push from Technology
  - DRAM Scaling at jeopardy
    - $\rightarrow$  Controllers close to DRAM
    - $\rightarrow$  Industry open to new memory architectures
- Pull from Systems and Applications
  - Data access is a major system and application bottleneck
  - Systems are energy limited
  - Data movement much more energy-hungry than computation



## Two Approaches to In-Memory Processing

- I. Minimally change DRAM to enable simple yet powerful computation primitives
  - <u>RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data</u> (Seshadri et al., MICRO 2013)
  - □ Fast Bulk Bitwise AND and OR in DRAM (Seshadri et al., IEEE CAL 2015)

- 2. Exploit the control logic in 3D-stacked memory to enable more comprehensive computation near memory
  - <u>PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-</u> <u>Memory Architecture</u> (Ahn et al., ISCA 2015)
  - <u>A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing</u> (Ahn et al., ISCA 2015)

### Today's Memory: Bulk Data Copy



1046ns, 3.6uJ (for 4KB page copy via DMA)

#### Future: RowClone (In-Memory Copy)



## DRAM Subarray Operation (load one byte)



### RowClone: In-DRAM Row Copy



#### Generalized RowClone

0.01% area cost



## RowClone: Latency and Energy Savings



Seshadri et al., "RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data," MICRO 2013.

#### RowClone: Application Performance



#### RowClone: Multi-Core Performance



#### End-to-End System Design



How to communicate occurrences of bulk copy/ initialization across layers?

How to ensure cache coherence?

How to maximize latency and energy savings?

How to handle data reuse?

### Goal: Ultra-Efficient Processing Near Data



#### Memory similar to a "conventional" accelerator

### Enabling In-Memory Search



- What is a flexible and scalable memory interface?
- What is the right partitioning of computation capability?
- What is the right low-cost memory substrate?
- What memory technologies are the best enablers?
- How do we rethink/ease search algorithms/applications?

## Challenge for Interconnects

Efficient Data Movement



#### Enabling In-Memory Computation

| DRAM<br>Support                                  | Cache<br>Coherence                  | Virtual Memory<br>Support    |
|--------------------------------------------------|-------------------------------------|------------------------------|
| RowClone<br>(MICRO 2013)                         | Dirty-Block<br>Index<br>(ISCA 2014) | Page Overlays<br>(ISCA 2015) |
| In-DRAM<br>Gather Scatter<br>(MICRO 2015)        | Non-contiguous<br>Cache lines       | Gathered Pages               |
| In-DRAM Bitwise<br>Operations<br>(IEEE CAL 2015) | ?                                   | ?                            |

#### In-DRAM AND/OR: Triple Row Activation



#### In-DRAM Bulk Bitwise AND/OR Operation

- BULKAND A,  $B \rightarrow C$
- Semantics: Perform a bitwise AND of two rows A and B and store the result in row C
- R0 reserved zero row, R1 reserved one row
- D1, D2, D3 Designated rows for triple activation
- 1. RowClone A into D1
- 2. RowClone B into D2
- 3. RowClone R0 into D3
- 4. ACTIVATE D1,D2,D3
- 5. RowClone Result into C

### In-DRAM AND/OR Results

- 20X improvement in AND/OR throughput vs. Intel AVX
- 50.5X reduction in memory energy consumption
- At least <u>30%</u> performance improvement in range queries



Seshadri+, "Fast Bulk Bitwise AND and OR in DRAM", IEEE CAL 2015.

## Going Forward



algorithms, compilers, and system designs that can take advantage of the model



## Two Approaches to In-Memory Processing

- 1. Minimally change DRAM to enable simple yet powerful computation primitives
  - <u>RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data</u> (Seshadri et al., MICRO 2013)
  - □ Fast Bulk Bitwise AND and OR in DRAM (Seshadri et al., IEEE CAL 2015)

- 2. Exploit the control logic in 3D-stacked memory to enable more comprehensive computation near memory
  - <u>PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-</u> <u>Memory Architecture</u> (Ahn et al., ISCA 2015)
  - <u>A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing</u> (Ahn et al., ISCA 2015)



### Tesseract System for Graph Processing





### **Evaluated Systems**



SAFARI

### Workloads

#### Five graph processing algorithms

- Average teenage follower
- Conductance
- PageRank
- Single-source shortest path
- Vertex cover

#### Three real-world large graphs

- Ijournal-2008 (social network)
- enwiki-2003 (Wikipedia)
- indochina-0024 (web graph)
- □ 4~7M vertices, 79~194M edges

### Tesseract Graph Processing Performance



### Tesseract Graph Processing Performance





### Memory Energy Consumption (Normalized)



Challenge for Interconnects

### Interconnect Design for 3D-stacked Memory



### Rethinking DRAM

In-Memory Computation





- Latency
- Bandwidth
- Energy
- Memory Compression

#### SAFARI

### DRAM Refresh

DRAM capacitor charge leaks over time

- The memory controller needs to refresh each row periodically to restore charge
  - Activate each row every N ms
  - Typical N = 64 ms
- Downsides of refresh
  - -- Energy consumption: Each refresh consumes energy
  - -- Performance degradation: DRAM rank/bank unavailable while refreshed
  - -- QoS/predictability impact: (Long) pause times during refresh
  - -- Refresh rate limits DRAM capacity scaling

BL

SENSE

WL

CAP

### Refresh Overhead: Performance



Liu et al., "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012.

### Refresh Overhead: Energy



Liu et al., "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012.

### Retention Time Profile of DRAM

# 64-128ms >256ms

128-256ms

### RAIDR: Eliminating Unnecessary Refreshes

- Observation: Most DRAM rows can be refreshed much less often without losing data [Kim+, EDL'09][Liu+ ISCA'13]
- Key idea: Refresh rows containing weak cells more frequently, other rows less frequently
   Drefiling: Drefile retention time of all rows

**1. Profiling:** Profile retention time of all rows



2. Binning: Store rows into bins by retention time in memory controller

Efficient storage with Bloom Filters (only 1.25KB for 32GB memory)

**3. Refreshing:** Memory controller refreshes rows in different bins at different rates

- Results: 8-core, 32GB, SPEC, TPC-C, TPC-H  $\frac{1}{2}$ 
  - 74.6% refresh reduction @ 1.25KB storage
  - □ ~16%/20% DRAM dynamic/idle power reduction
  - ~9% performance improvement

SAFARI

Benefits increase with DRAM capacity



Liu et al., "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012.

### Going Forward (for DRAM and Flash)

#### How to find out weak memory cells/rows

- Liu+, "An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms", ISCA 2013.
- Khan+, "The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study," SIGMETRICS 2014.

#### Low-cost system-level tolerance of memory errors

- Luo+, "Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost," DSN 2014.
- Cai+, "Error Analysis and Retention-Aware Error Management for NAND Flash Memory," Intel Technology Journal 2013.
- Cai+, "Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories," SIGMETRICS 2014.

#### Tolerating cell-to-cell interference at the system level

- Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.
- Cai+, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation," ICCD 2013.

### Experimental DRAM Testing Infrastructure



Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case (Lee et al., HPCA 2015)

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015) An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms (Liu et al., ISCA 2013)

The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study (Khan et al., SIGMETRICS 2014)



#### SAFARI

### Experimental Infrastructure (DRAM)



**SAFARI** 

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

### More Information [ISCA'13, SIGMETRICS'14]

### The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study

Samira Khan<sup>†</sup>\* samirakhan@cmu.edu Donghyuk Lee<sup>†</sup> donghyuk1@cmu.edu Yoongu Kim<sup>†</sup> yoongukim@cmu.edu

Alaa R. Alameldeen\* Chris Wilkerson\* alaa.r.alameldeen@intel.com chris.wilkerson@intel.com Onur Mutlu<sup>†</sup> onur@cmu.edu

<sup>†</sup>Carnegie Mellon University \*Intel Labs

### Online Profiling of DRAM In the Field



without disturbing the system and applications

### Challenge for Interconnects

### Fault Tolerance



### Rethinking DRAM

- In-Memory Computation
- Refresh
- Reliability



- Bandwidth
- Energy
- Memory Compression

#### SAFARI

### **DRAM Latency-Capacity Trend**





DRAM latency continues to be a critical bottleneck, especially for response time-sensitive workloads<sup>92</sup>

### What Causes the Long Latency?

**DRAM** Chip



### Why is the Subarray So Slow?



- Long bitline
  - Amortizes sense amplifier cost  $\rightarrow$  Small area
  - Large bitline capacitance  $\rightarrow$  High latency & power

### Trade-Off: Area (Die Size) vs. Latency Long Bitline Short Bitline



### Trade-Off: Area (Die Size) vs. Latency



### **Approximating the Best of Both Worlds**



### **Approximating the Best of Both Worlds**



### Commodity DRAM vs. TL-DRAM [HPCA 2013]

• DRAM Latency (tRC) • DRAM Power



DRAM Area Overhead

**~3%**: mainly due to the isolation transistors

### Trade-Off: Area (Die-Area) vs. Latency



### Leveraging Tiered-Latency DRAM

- TL-DRAM is a *substrate* that can be leveraged by the hardware and/or software
- Many potential uses
  - Use near segment as hardware-managed *inclusive* cache to far segment
  - 2. Use near segment as hardware-managed *exclusive* cache to far segment
  - 3. Profile-based page mapping by operating system
  - 4. Simply replace DRAM with TL-DRAM

### **Performance & Power Consumption**



## Using near segment as a cache improves performance and reduces power consumption

Lee+, "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture," HPCA 2013.

Challenge for Interconnects

### Interconnect Design for Low-latency Memory



### What Else Causes the Long DRAM Latency?

#### Conservative timing margins!

DRAM timing parameters are set to cover the worst case

#### Worst-case temperatures

- 85 degrees vs. common-case
- □ to enable a wide range of operating conditions

#### Worst-case devices

- DRAM cell with smallest charge across any acceptable device
- to tolerate process variation at acceptable yield

#### • This leads to large timing margins for the common case

### Adaptive-Latency DRAM [HPCA 2015]

- Idea: Optimize DRAM timing for the common case
  - Current temperature
  - Current DRAM module
- Why would this reduce latency?
  - A DRAM cell can store much more charge in the common case (low temperature, strong cell) than in the worst case
  - More charge in a DRAM cell
    - $\rightarrow$  Faster sensing, charge restoration, precharging
    - $\rightarrow$  Faster access (read, write, refresh, ...)

### AL-DRAM

• Key idea

Optimize DRAM timing parameters online

Two components

 DRAM manufacturer provides multiple sets of
 aliable DRAM timing parameterers at different
 temperatures for each DIMM
 System monitor
 temperature
 temperature

Latency Reduction Summary of 115 DIMMs

- Latency reduction for read & write  $(55^{\circ}C)$ 
  - -Read Latency: 32.7%
  - Write Latency: 55.1%
- Latency reduction for each timing parameter (55°C)
  - Sensing: 17.3%
  - -Restore: 37.3% (read), 54.8% (write)
  - Precharge: **35.2%**

SAFARI

Lee+, "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case," HPCA 107 2015.

### AL-DRAM: Real System Evaluation

### • System

### -CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB

D18F2x200\_dct[0]\_mp[1:0] DDR3 DRAM Timing 0

Reset: 0F05\_0505h. See 2.9.3 [DCT Configuration Registers].

|       |                                                                                                                                                                                                                                                 | 1                                                                                                                                                                                                                                                          |
|-------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Bits  | Description                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                            |
| 31:30 | Reserved.                                                                                                                                                                                                                                       |                                                                                                                                                                                                                                                            |
| 29:24 |                                                                                                                                                                                                                                                 | e strobe. Read-write. BIOS: See 2.9.7.5 [SPD ROM-Based Configuration]. Specifies<br>ne in memory clock cycles from an activate command to a precharge command, both<br>select bank.<br><u>Description</u><br>Reserved<br><tras> clocks<br/>Reserved</tras> |
| 23:21 | Reserved.                                                                                                                                                                                                                                       |                                                                                                                                                                                                                                                            |
| 20:16 | <b>Trp: row precharge time</b> . Read-write. BIOS: See 2.9.7.5 [SPD ROM-Based Configuration]. Specifies the minimum time in memory clock cycles from a precharge command to an activate command or auto refresh command, both to the same bank. |                                                                                                                                                                                                                                                            |



AL-DRAM improves performance on a rull system

### **AL-DRAM: Multi-Core Evaluation**



# Rethinking DRAM

- In-Memory Computation
- Refresh
- Reliability
- Latency









- Major Trends Affecting Main Memory
- The Memory Scaling Problem and Solution Directions
  - New Memory Architectures
  - Enabling Emerging Technologies
- How Can We Do Better?
- Summary



# Solution 2: Emerging Memory Technologies

- Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile)
- Example: Phase Change Memory
  - Data stored by changing phase of material
  - Data read by detecting material's resistance
  - Expected to scale to 9nm (2022 [ITRS])
  - Prototyped at 20nm (Raoux+, IBM JRD 2008)
  - Expected to be denser than DRAM: can store multiple bits/cell
- But, emerging technologies have (many) shortcomings
   Can they be enabled to replace/augment/surpass DRAM?



## Limits of Charge Memory

- Difficult charge placement and control
  - Flash: floating gate charge
  - DRAM: capacitor charge, transistor leakage
- Reliable sensing becomes difficult as charge storage unit size reduces



# Promising Resistive Memory Technologies

#### PCM

- Inject current to change material phase
- Resistance determined by phase

#### STT-MRAM

- Inject current to change magnet polarity
- Resistance determined by polarity
- Memristors/RRAM/ReRAM
  - Inject current to change atomic structure
  - Resistance determined by atom distance

### Phase Change Memory: Pros and Cons

- Pros over DRAM
  - Better technology scaling (capacity and cost)
  - Non volatility
  - Low idle power (no refresh)

#### Cons

- Higher latencies: ~4-15x DRAM (especially write)
- □ Higher active energy: ~2-50x DRAM (especially write)
- Lower endurance (a cell dies after  $\sim 10^8$  writes)
- Reliability issues (resistance drift)
- Challenges in enabling PCM as DRAM replacement/helper:
  - Mitigate PCM shortcomings
  - Find the right way to place PCM in the system

# PCM-based Main Memory (I)

How should PCM-based (main) memory be organized?



- Hybrid PCM+DRAM [Qureshi+ ISCA'09, Dhiman+ DAC'09]:
  - How to partition/migrate data between PCM and DRAM



# PCM-based Main Memory (II)

How should PCM-based (main) memory be organized?



- Pure PCM main memory [Lee et al., ISCA'09, Top Picks'10]:
  - How to redesign entire hierarchy (and cores) to overcome PCM shortcomings



# An Initial Study: Replace DRAM with PCM

- Lee, Ipek, Mutlu, Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA 2009.
  - □ Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC)
  - Derived "average" PCM parameters for F=90nm

#### Density

- $\triangleright$  9 12 $F^2$  using BJT
- ▷ 1.5× DRAM

#### Endurance

- ▷ 1E+08 writes
- ▷ 1E-08× DRAM

#### Latency

⊳ 50ns Rd, 150ns Wr

 $\triangleright$  4×, 12× DRAM

Energy

▷ 40µA Rd, 150µA Wr

 $\triangleright$  2×, 43× DRAM



### Results: Naïve Replacement of DRAM with PCM

- Replace DRAM with PCM in a 4-core, 4MB L2 system
- PCM organized the same as DRAM: row buffers, banks, peripherals
- 1.6x delay, 2.2x energy, 500-hour average lifetime



 Lee, Ipek, Mutlu, Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA 2009.

# Results: Architected PCM as Main Memory

- 1.2x delay, 1.0x energy, 5.6-year average lifetime
- Scaling improves energy, endurance, density



- Caveat 1: Worst-case lifetime is much shorter (no guarantees)
- Caveat 2: Intensive applications see large performance and energy hits
- Caveat 3: Optimistic PCM parameters?

# Solution 3: Hybrid Memory Systems



#### Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. Yoon+, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012 Best Paper Award.

# Hybrid vs. All-PCM/DRAM [ICCD'12]

■ 16GB PCM ■ RBLA-Dyn ■ 16GB DRAM



Yoon+, "Row Buffer Locality-Aware Data Placement in Hybrid Memories," ICCD 2012 Best Paper Award.

Challenge for Interconnects

# Efficient Data Movement (across Multiple Memories)

### STT-MRAM as Main Memory

- Magnetic Tunnel Junction (MTJ) device
   Reference layer: Fixed magnetic orientation
   Free layer: Parallel or anti-parallel
- Magnetic orientation of the free layer determines logical state of device
   High vs. low resistance
- Write: Push large current through MTJ to change orientation of free layer
- Read: Sense current flow
- Kultursay et al., "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative," ISPASS 2013.







### STT-MRAM: Pros and Cons

- Pros over DRAM
  - Better technology scaling
  - Non volatility
  - Low idle power (no refresh)
- Cons
  - Higher write latency
  - Higher write energy
  - Reliability?
- Another level of freedom
  - Can trade off non-volatility for lower write latency/energy (by reducing the size of the MTJ)

# Architected STT-MRAM as Main Memory

- 4-core, 4GB main memory, multiprogrammed workloads
- ~6% performance loss, ~60% energy savings vs. DRAM



Kultursay+, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative," ISPASS 2013.

### Other Opportunities with Emerging Technologies

Merging of memory and storage

• e.g., a single interface to manage all data

#### New applications

- e.g., ultra-fast checkpoint and restore
- More robust system design
  - e.g., reducing data loss
- Processing tightly-coupled with memory
  - e.g., enabling efficient search and filtering

### Coordinated Memory and Storage with NVM (I)

- The traditional two-level storage model is a bottleneck with NVM
  - Volatile data in memory  $\rightarrow$  a load/store interface
  - **Persistent** data in storage  $\rightarrow$  a **file system** interface
  - Problem: Operating system (OS) and file system (FS) code to locate, translate, buffer data become performance and energy bottlenecks with fast NVM stores



### Coordinated Memory and Storage with NVM (II)

- Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data
  - Improves both energy and performance
  - Simplifies programming model as well



### The Persistent Memory Manager (PMM)

- Exposes a load/store interface to access persistent data
  - Applications can directly access persistent memory  $\rightarrow$  no conversion, translation, location overhead for persistent data
- Manages data placement, location, persistence, security
   To get the best of multiple forms of storage
- Manages metadata storage and retrieval
   This can lead to overheads that need to be managed
- Exposes hooks and interfaces for system software
   To enable better data placement and management decisions
- Meza+, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory," WEED 2013.

### The Persistent Memory Manager (PMM)



PMM uses access and hint information to allocate, locate, migrate and access data in the heterogeneous array of devices

### Performance Benefits of a Single-Level Store



**SAFARI** Meza+, "A Case for Efficient Hardware-Software Cooperative Management of 133 Storage and Memory," WEED 2013.

### Energy Benefits of a Single-Level Store



**SAFARI** Meza+, "A Case for Efficient Hardware-Software Cooperative Management of 134 Storage and Memory," WEED 2013.

### Challenge for Interconnects

# Design for Emerging Memories



- Major Trends Affecting Main Memory
- The Memory Scaling Problem and Solution Directions
  - New Memory Architectures
  - Enabling Emerging Technologies
- How Can We Do Better?
- Summary



# Principles (So Far)

Better cooperation between devices and the system

- Expose more information about devices to upper layers
- More flexible interfaces
- Better-than-worst-case design
  - Do not optimize for the worst case
  - Worst case should not determine the common case
- Heterogeneity in design (specialization, asymmetry)
  - Enables a more efficient design (No one size fits all)
- These principles are coupled



- Major Trends Affecting Main Memory
- The Memory Scaling Problem and Solution Directions
  - New Memory Architectures
  - Enabling Emerging Technologies
- How Can We Do Better?
- Summary

# Summary: Memory Scaling

- Memory scaling problems are a critical bottleneck for system performance, efficiency, and usability
- New memory architectures
  - A lot of hope in fixing DRAM
- Enabling emerging NVM technologies
  - A lot of hope in hybrid memory systems and single-level stores
- System-level memory/storage QoS
  - A lot of hope in designing a predictable system
- Three principles are essential for scaling
  - Software/hardware/device cooperation
  - Better-than-worst-case design
  - Heterogeneity (specialization, asymmetry)

Takeaway for Interconnects

Memory-Centric Interconnect Design





### Future?



### Better Future?



# Acknowledgments

- My current and past students and postdocs
  - Rachata Ausavarungnirun, Abhishek Bhowmick, Amirali Boroumand, Rui Cai, Yu Cai, Kevin Chang, Saugata Ghose, Kevin Hsieh, Tyler Huberty, Ben Jaiyen, Samira Khan, Jeremie Kim, Yoongu Kim, Yang Li, Jamie Liu, Lavanya Subramanian, Donghyuk Lee, Yixin Luo, Justin Meza, Gennady Pekhimenko, Vivek Seshadri, Lavanya Subramanian, Nandita Vijaykumar, HanBin Yoon, Jishen Zhao, ...
- My collaborators at CMU
  - Greg Ganger, Phil Gibbons, Mor Harchol-Balter, James Hoe, Mike Kozuch, Ken Mai, Todd Mowry, ...
- My collaborators elsewhere
  - Can Alkan, Chita Das, Sriram Govindan, Norm Jouppi, Mahmut Kandemir, Konrad Lai, Yale Patt, Moinuddin Qureshi, Partha Ranganathan, Bikash Sharma, Kushagra Vaid, Chris Wilkerson, ...

### Funding Acknowledgments

- NSF
- GSRC
- SRC
- CyLab
- AMD, Google, Facebook, HP Labs, Huawei, IBM, Intel, Microsoft, Nvidia, Oracle, Qualcomm, Rambus, Samsung, Seagate, VMware

### Open Source Tools

- Rowhammer
  - <u>https://github.com/CMU-SAFARI/rowhammer</u>
- Ramulator
  - https://github.com/CMU-SAFARI/ramulator
- MemSim
  - <u>https://github.com/CMU-SAFARI/memsim</u>
- NOCulator
  - https://github.com/CMU-SAFARI/NOCulator
- DRAM Error Model
  - http://www.ece.cmu.edu/~safari/tools/memerr/index.html
- Other open-source software from my group
  - https://github.com/CMU-SAFARI/
  - <u>http://www.ece.cmu.edu/~safari/tools.html</u>

#### SAFARI

#### **Referenced Papers**

All are available at

http://users.ece.cmu.edu/~omutlu/projects.htm http://scholar.google.com/citations?user=7XyGUGkAAAAJ&hl=en

- A detailed accompanying overview paper
  - Onur Mutlu and Lavanya Subramanian,
     <u>"Research Problems and Opportunities in Memory</u> <u>Systems"</u>
     *Invited Article in <u>Supercomputing Frontiers and Innovations</u>* (SUPERFRI), 2015.



#### Related Videos and Course Materials

- <u>Undergraduate Computer Architecture Course Lecture</u> <u>Videos (2013, 2014, 2015)</u>
- <u>Undergraduate Computer Architecture Course</u> <u>Materials (2013, 2014, 2015)</u>
- Graduate Computer Architecture Course Materials (Lecture Videos)
- Parallel Computer Architecture Course Materials (Lecture Videos)
- Memory Systems Short Course Materials (Lecture Video on Main Memory and DRAM Basics)

Thank you.

onur@cmu.edu

http://users.ece.cmu.edu/~omutlu/

## Rethinking Memory System Design (along with Interconnects)

Onur Mutlu

onur@cmu.edu

http://users.ece.cmu.edu/~omutlu/

December 5, 2015 NoCArc 2015 Keynote





Backup Slides

# Memory-Interconnect Examples

#### Some Examples

- Application-awareness in the interconnect [Das+ міско'09, нрса'13]
- Focus on critical requests [Aergia, Das+ ISCA'10]
- Quality of Service and predictability [Grot+ MICRO'09, ISCA'11]
- Ultra-efficient data movement [BLESS ISCA'09, Chang+ HPCA'16]
- Interconnect design for memory systems (DRAM, hybrid memory, NVM) [Lee+ HPCA'13, Chang+ HPCA'16]



### Application-Aware Interconnect Design

- Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das, "Application-Aware Prioritization Mechanisms for On-Chip Networks" Proceedings of the <u>42nd International Symposium on Microarchitecture</u> (MICRO), pages 280-291, New York, NY, December 2009. <u>Slides (pptx)</u>
- Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi,

"Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems"

Proceedings of the <u>19th International Symposium on High-Performance Computer Architecture</u> (**HPCA**), Shenzhen, China, February 2013. <u>Slides (pptx)</u>

Asit K. Mishra, Onur Mutlu, and Chita R. Das,

<u>"A Heterogeneous Multiple Network-on-Chip Design: An Application-</u> <u>Aware Approach</u>"

*Proceedings of the <u>50th Design Automation Conference</u> (DAC), Austin, TX, June 2013. <u>Slides (pptx)</u> <u>Slides (pdf)</u>* 

#### SAFARI

#### Focus on Criticality

 Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das, "Aergia: Exploiting Packet Latency Slack in On-Chip Networks" Proceedings of the <u>37th International Symposium on Computer Architecture</u> (ISCA), pages 106-116, Saint-Malo, France, June 2010. <u>Slides (pptx)</u>

#### QoS and Predictability

- Boris Grot, Stephen W. Keckler, and Onur Mutlu, "Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip" Proceedings of the <u>42nd International Symposium on Microarchitecture</u> (MICRO), pages 268-279, New York, NY, December 2009. <u>Slides (pdf)</u>
- Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu, <u>"Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for</u> <u>Scalability and Service Guarantees</u>"

*Proceedings of the <u>38th International Symposium on Computer Architecture</u> (ISCA), San Jose, CA, June 2011. <u>Slides (pptx)</u>* 

### Efficient Data Movement: Bufferless (I)

- Thomas Moscibroda and Onur Mutlu, <u>"A Case for Bufferless Routing in On-Chip Networks"</u> *Proceedings of the <u>36th International Symposium on Computer Architecture</u> (ISCA), pages 196-207, Austin, TX, June 2009. <u>Slides (pptx)</u>*
- Chris Fallin, Chris Craik, and Onur Mutlu, <u>"CHIPPER: A Low-Complexity Bufferless Deflection Router"</u> *Proceedings of the <u>17th International Symposium on High-Performance Computer Architecture</u> (HPCA), pages 144-155, San Antonio, TX, February 2011. <u>Slides (pptx)</u>*
- Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and Onur Mutlu, <u>"MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect"</u> *Proceedings of the <u>6th ACM/IEEE International Symposium on Networks on Chip</u> (NOCS), Lyngby, Denmark, May 2012. <u>Slides (pptx) (pdf)</u>*
- Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna Das, Gabriel Loh, and Onur Mutlu,
   <u>"Design and Evaluation of Hierarchical Rings with Deflection Routing"</u>
   *Proceedings of the* <u>26th International Symposium on Computer Architecture and High Performance Computing</u> (SBAC-PAD), Paris, France, October 2014. [Slides (pptx) (pdf)] [Source Code]



### Efficient Data Movement: Bufferless (II)

 George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan,

<u>"On-Chip Networks from a Networking Perspective: Congestion and</u> Scalability in Many-core Interconnects"

*Proceedings of the <u>2012 ACM SIGCOMM Conference</u> (SIGCOMM), Helsinki, Finland, August 2012. <u>Slides (pptx)</u>* 

- George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu, "Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need?" Proceedings of the <u>9th ACM Workshop on Hot Topics in Networks</u> (HOTNETS), Monterey, CA, October 2010. <u>Slides (ppt) (key)</u>
- Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu, <u>"HAT: Heterogeneous Adaptive Throttling for On-Chip Networks"</u> *Proceedings of the* <u>24th International Symposium on Computer Architecture and High Performance</u> <u>Computing</u> (SBAC-PAD), New York, NY, October 2012. <u>Slides (pptx) (pdf)</u>

#### Efficient Data Movement: Heterogeneous

 Asit K. Mishra, Onur Mutlu, and Chita R. Das, <u>"A Heterogeneous Multiple Network-on-Chip Design: An Application-Aware Approach"</u> <u>Aware Approach</u> <u>Proceedings of the 50th Design Automation Conference</u> (DAC), Austin, TX,

June 2013. Slides (pptx) Slides (pdf)



#### Interconnect Design for Memory

- Kevin Chang et al., "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Migration in DRAM," HPCA 2016.
- Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu,
   "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture"
   Proceedings of the
   19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)



# NAND Flash Memory Scaling

#### Another Talk: NAND Flash Scaling Challenges

 Onur Mutlu, "Error Analysis and Management for MLC NAND Flash Memory" *Technical talk at <u>Flash Memory Summit 2014</u> (FMS)*, Santa Clara, CA, August 2014. <u>Slides (ppt) (pdf)</u>

Cai+, "Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis," DATE 2012.

Cai+, "Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime," ICCD 2012.

Cai+, "Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis and Modeling," DATE 2013.

Cai+, "Error Analysis and Retention-Aware Error Management for NAND Flash Memory," Intel Technology Journal 2013.

Cai+, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation," ICCD 2013.

Cai+, "Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories," SIGMETRICS 2014. Cai+,"Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery," HPCA 2015.

Cai+, "Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation," DSN 2015. Luo+, "WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management," MSST 2015.

Meza+, "A Large-Scale Study of Flash Memory Errors in the Field," SIGMETRICS 2015.

#### SAFARI

#### Experimental Infrastructure (Flash)



[Cai+, DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, HPCA 2015, DSN 2015, MSST 2015] NAND Daughter Board



Error Management in MLC NAND Flash Flash Memory

- Problem: MLC NAND flash memory reliability/endurance is a key challenge for satisfying future storage systems' requirements
- Our Goals: (1) Build reliable error models for NAND flash memory via experimental characterization, (2) Develop efficient techniques to improve reliability and endurance
- This talk provides a "flash" summary of our recent results published in the past 3 years:
  - Experimental error and threshold voltage characterization [DATE'12&13]
  - Retention-aware error management [ICCD'12]
  - Program interference analysis and read reference V prediction [ICCD'13]
  - Neighbor-assisted error correction [SIGMETRICS'14]



Ramulator: A Fast and Extensible DRAM Simulator [IEEE Comp Arch Letters'15]

#### Ramulator Motivation

- DRAM and Memory Controller landscape is changing
- Many new and upcoming standards
- Many new controller designs
- A fast and easy-to-extend simulator is very much needed

| Segment     | DRAM Standards & Architectures                                                                                                                                                                                                        |  |  |  |
|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Commodity   | DDR3 (2007) [14]; DDR4 (2012) [18]                                                                                                                                                                                                    |  |  |  |
| Low-Power   | LPDDR3 (2012) [17]; LPDDR4 (2014) [20]                                                                                                                                                                                                |  |  |  |
| Graphics    | GDDR5 (2009) [15]                                                                                                                                                                                                                     |  |  |  |
| Performance | eDRAM [28], [32]; RLDRAM3 (2011) [29]                                                                                                                                                                                                 |  |  |  |
| 3D-Stacked  | WIO (2011) [16]; WIO2 (2014) [21]; MCDRAM (2015) [13];<br>HBM (2013) [19]; HMC1.0 (2013) [10]; HMC1.1 (2014) [11]                                                                                                                     |  |  |  |
| Academic    | SBA/SSA (2010) [38]; Staged Reads (2012) [8]; RAIDR (2012) [27];<br>SALP (2012) [24]; TL-DRAM (2013) [26]; RowClone (2013) [37];<br>Half-DRAM (2014) [39]; Row-Buffer Decoupling (2014) [33];<br>SARP (2014) [6]; AL-DRAM (2015) [25] |  |  |  |
|             | Table 1. Landscape of DRAM-based memory                                                                                                                                                                                               |  |  |  |

#### Ramulator

- Provides out-of-the box support for many DRAM standards:
  - DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, plus new proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP)
- ~2.5X faster than fastest open-source simulator
- Modular and extensible to different standards

| Simulator   | Cycles | (10 <sup>6</sup> ) | Runtime | e (sec.) | Req/sec | (10 <sup>3</sup> ) | Memory  |
|-------------|--------|--------------------|---------|----------|---------|--------------------|---------|
| (clang -03) | Random | Stream             | Random  | Stream   | Random  | Stream             | (MB)    |
| Ramulator   | 652    | 411                | 752     | 249      | 133     | 402                | 2.1     |
| DRAMSim2    | 645    | 413                | 2,030   | 876      | 49      | 114                | 1.2     |
| USIMM       | 661    | 409                | 1,880   | 750      | 53      | 133                | 4.5     |
| DrSim       | 647    | 406                | 18,109  | 12,984   | 6       | 8                  | 1.6     |
| NVMain      | 666    | 413                | 6,881   | 5,023    | 15      | 20                 | 4,230.0 |

Table 3. Comparison of five simulators using two traces



#### Case Study: Comparison of DRAM Standards

| Standard          | Rate<br>(MT/s) | Timing<br>(CL-RCD-RP) | Data-Bus<br>(Width×Chan.) | Rank-per-Chan | BW<br>(GB/s) |
|-------------------|----------------|-----------------------|---------------------------|---------------|--------------|
| DDR3              | 1,600          | 11-11-11              | $64$ -bit $\times 1$      | 1             | 11.9         |
| DDR4              | 2,400          | 16-16-16              | $64$ -bit $\times 1$      | 1             | 17.9         |
| SALP <sup>†</sup> | 1,600          | 11-11-11              | $64$ -bit $\times 1$      | 1             | 11.9         |
| LPDDR3            | 1,600          | 12 - 15 - 15          | $64$ -bit $\times 1$      | 1             | 11.9         |
| LPDDR4            | 2,400          | 22-22-22              | $32$ -bit $	imes 2^*$     | 1             | 17.9         |
| GDDR5 [12]        | 6,000          | 18-18-18              | $64$ -bit $\times 1$      | 1             | 44.7         |
| HBM               | 1,000          | 7-7-7                 | $128$ -bit $\times 8^*$   | 1             | 119.2        |
| WIO               | 266            | 7-7-7                 | $128$ -bit $	imes 4^*$    | 1             | 15.9         |
| WIO2              | 1,066          | 9-10-10               | $128$ -bit $	imes 8^*$    | 1             | 127.2        |



#### Ramulator Paper and Source Code

- Yoongu Kim, Weikun Yang, and <u>Onur Mutlu</u>,
   "Ramulator: A Fast and Extensible DRAM Simulator" <u>IEEE Computer Architecture Letters</u> (CAL), March 2015.
   [Source Code]
- Source code is released under the liberal MIT License
   <u>https://github.com/CMU-SAFARI/ramulator</u>

# DRAM Infrastructure

#### Experimental DRAM Testing Infrastructure



Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case (Lee et al., HPCA 2015)

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015) An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms (Liu et al., ISCA 2013)

The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study (Khan et al., SIGMETRICS 2014)



#### SAFARI

#### Experimental Infrastructure (DRAM)



**SAFARI** 

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

# ThyNVM: Software-Transparent Crash Consistency in NVMs

### ThyNVM: Transparent Hybrid NVM

- **Problem**: How do you provide consistency and prevent data corruption in NVM upon a system crash?
- Goal: Provide efficient programmer-transparent crash consistency in hybrid NVM
  - Transparency: no library APIs or explicit interfaces to access NVM; just loads and stores
    - Easier to support legacy code and hypervisors
    - No programmer effort to adopt persistent memory
  - **Efficiency**: use hybrid DRAM/NVM for high performance

#### ThyNVM

#### Idea 1: Transparent periodic checkpointing of data



- Need to overlap checkpointing and execution
- Idea 2: Differentiated checkpointing schemes for different types of updates
  - Page Writeback: for sequential accesses (use DRAM)
  - Address Remapping: for random accesses (use NVM/DRAM)
- Idea 3: Coordination/switching between checkpointing schemes for high performance

**SAFARI** Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory<sub>175</sub> Systems," MICRO 2015.

### Checkpointing Tradeoffs in Hybrid Memory

|                        |                                 | Checkpointing                                                                                                                                  | g granularity                                                                                                                                                                     |
|------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                        |                                 | Small (cache block)                                                                                                                            | Large (page)                                                                                                                                                                      |
| n of<br>copy           | DRAM<br>(based on<br>writeback) | Inefficient <ul> <li>Large metadata overhead</li> <li>Long checkpointing</li> <li>latency</li> </ul>                                           | <ul> <li>Partially efficient</li> <li>Small metadata</li> <li>overhead</li> <li>Long checkpointing latency</li> </ul>                                                             |
| Location<br>working co | NVM<br>(based on<br>remapping)  | <ul> <li>Partially efficient</li> <li>Large metadata overhead</li> <li>Short checkpointing</li> <li>latency</li> <li>Fast remapping</li> </ul> | <ul> <li>Inefficient</li> <li>Small metadata overhead</li> <li>Short checkpointing latency</li> <li>Slow remapping         <ul> <li>(on the critical path)</li> </ul> </li> </ul> |

**SAFARI** Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory<sub>176</sub> Systems," MICRO 2015.

#### ThyNVM: Dual-Scheme Checkpointing

- Idea: Combine two types of checkpointing schemes to adapt to different types of access patterns
- Sparse updates with low spatial locality  $\rightarrow$  address remapping
  - $\rightarrow$  block granularity checkpointing
  - $\rightarrow$  working copy stored in NVM (for short ckpt latency)
- Dense updates with high spatial locality  $\rightarrow$  page writeback
  - $\rightarrow$  page granularity checkpointing (small metadata)

 $\rightarrow$  working copy stored in DRAM for fast buffering; written back to NVM during ckpt.

#### • Can switch between schemes when one is on critical path

**SAFARI** Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory<sub>177</sub> Systems," MICRO 2015.

#### ThyNVM Performance (I)

In-memory storage workloads



8.8%/29.9% higher throughput than journaling/shadow paging with a hash table based key-value store

**SAFARI** Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems," MICRO 2015.

### ThyNVM Performance (II)

Legacy compute-intensive workloads



- Within 3.4% of Ideal DRAM,
- □ 2.7% higher performance than Ideal NVM.

**SAFARI** Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems," MICRO 2015.