

# Landscape of Synaptic Weight Memories

Prof. Shimeng Yu Georgia Institute of Technology

Email: <u>shimeng.yu@ece.gatech.edu</u> Web: <u>https://shimeng.ece.gatech.edu/</u>

### Outline

- Background and Motivation
- Synaptic Devices: State-of-the-Art
- Variability and Reliability Characterization at Array-level
- Benchmark of Synaptic Devices for Inference and Training
- Chip-level Demonstrations: State-of-the-art

#### **Artificial Intelligence (AI) Applications**



Waymo is first to put fully self-driving cars on US roads without a safety driver

Going Level 4 in Arizona

#### **1HEVERGE**



Pranav Rajpurkar<sup>\*</sup>, Jeremy Irvin<sup>\*</sup>, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew P. Lungren, Andrew Y. Ng



Google's new wireless headphones can translate <u>languages on the fly</u>

7:30 PM ET Wed. 4 Oct 2017 | 00:57

Al today is widely used in computer vision (i.e. image classification), natural language processing (i.e. language translation), etc.

Deep Convolutional Network (DCN)



Deep Residual Network (DRN)



Generative Adversarial Network (GAN)









#### Deep neural network (DNN) topologies

http://www.asimovinstitute.org/neural-network-zoo/

#### **Typical DNN Models for Image Classification**





| CIFAR-10     |                 |                     | IMAGENET     |                 |                     |  |
|--------------|-----------------|---------------------|--------------|-----------------|---------------------|--|
| Network      | Parameters (MB) | Total operations(G) | Network      | Parameters (MB) | Total operations(G) |  |
| VGG-8        | 13              | 0.60                | AlexNet      | 61              | 1.44                |  |
| ResNet-20    | 0.27            | 0.04                | VGG-16       | 138             | 31                  |  |
| ResNet-32    | 0.46            | 0.07                | VGG-19       | 144             | 39                  |  |
| ResNet-44    | 0.66            | 0.11                | ResNet-18    | 11              | 3.7                 |  |
| ResNet-110   | 1.7             | 0.26                | ResNet-34    | 23              | 7.2                 |  |
| DenseNet-40  | 1.0             | 0.28                | ResNet-152   | 60              | 23                  |  |
| DenseNet-100 | 7.0             | 0.94                | DenseNet-121 | 7.9             | 5.9                 |  |

For image classification, model size tens of MB

For language translation, model size can be up to 10 GB

→ Require 10MB to 10 GB on-chip memories

→ Thus requires multi-bit and 3D integration

#### **CONVOLUTIONAL NEURAL NETWORKS**



**Training:** to learn weights iteratively with back-propagation of errors from the output labeled data  $\rightarrow$  "write" intensive to synaptic weight memories **Inference:** after training is done, feedforward propagation for prediction only  $\rightarrow$   $\rightarrow$  "read" intensive to synaptic weight memories **Most intensive computation:** vector-matrix-multiplication (to be accelerated by hardware)

#### **Hardware Accelerators for Al**

125 µm

- GPU still dominates the training in cloud, FPGA is good for inference for fast prototyping
- TPU (or similar digital ASIC) is ramping up in cloud as well as edge



- To further improve energy efficiency (TOPS/W), analog CIM (possibly with eNVMs) is promising especially in the edge inference where the model is pre-trained.
- CIM chip could also support incremental learning with continuous (possibly unlabeled) new data (e.g. with reinforcement learning) when deployed to the field.

#### **CIM Basics: Mixed-Signal Compute**



8-bit weight may need 8 SRAM cells, and shift-add 8-bit weight may need 2 1T1R cells (if each cell is 4b/cell), and shift-add

#### Digital vs. Near-Memory vs. CIM Accelerator



**TPU-like digital accelerator:** PE only has MAC units such as multiplier and adders, while the data (both activation and weights) are accessed by shared global buffer (e.g. SRAM cache)  $\rightarrow$  Single row access, slow and inefficient

Weights are stored in memory array, while the activations are loaded in as input to WLs Near-memory compute: Row by row access with digital adders at periphery In-memory compute (CIM): Parallel access and ADC for partial sum quantization

### Outline

- Background and Motivation
- Synaptic Devices: State-of-the-Art
- Variability and Reliability Characterization at Array-level
- Benchmark of Synaptic Devices for Inference and Training
- Chip-level Demonstrations: State-of-the-art

### **Electronic Synapses and Neurons**

- Inspirations from biology and neuroscience
  Dendrites
  Action
  Post-synaptic neuron
  Action
  Pre-synaptic neuron
  Synapse
- Mathematical formulation in machine learning



#### **Abstractions for device engineers:**

**Synapses:** local memories that carry weights

- $\rightarrow$  Multi-bit memories
- 1) Two-terminal resistor
- 2) Three-terminal transistor (biased at linear region)



#### **Neurons:** simple thresholding compute units

- $\rightarrow$  Threshold switches
- 1) Abrupt switching in I-V
- 2) Returns to off-state at zero voltage (not memory)



#### Landscape of Analog Multi-bit Memories



Partial switching in these materials leads to analog multi-bit memories as synaptic weights RRAM and PCM are more current driven, and FeFET is electric field driven (less energy!)

STT-MRAM/SOT-MRAM can be used as binary synapse in principle, electrochemical random access memory (ECRAM) is premature. Therefore, we will not discuss about these candidates in this short course.

## **Key Device Properties for Training**

• Symmetry and linearity in weight update



- Normalized Conductance Linear Weight (Ideal) G+ 50 60 10 20 30 40 Pulse # 90% Accuracy %08 %08 Polarity of P/D 60% **CIFAR-10** 6 8 0 2 Nonlinearity Level(NL) 12
- Asymmetry (w/nonlinearity) is the primary cause of the in-situ training accuracy degradation.
- Algorithmic techniques such as momentum [1] has been introduced to compensate for the accuracy loss.

[1] S. Huang, et al. DATE 2020

### **Key Device Properties for Inference**

After training, the weights should be stable over time for inference (read only)



- Read stress or disturb
- Retention at high temperature
- Intermediate state stability is the key concern

X. Peng, et al. IEDM 2019

10<sup>5</sup>

Time (sec)

10<sup>3</sup>

10<sup>1</sup>

10<sup>7</sup>

10<sup>9</sup>

### Multi-bit RRAM



Ultimate Goal: Engineer for multiple weak filaments instead of a single strong filament



Varying-pulse amplitude scheme in the gradual reset regime to converge the target conductance into arbitrary analog levels within the dynamic range.



L. Gao, et al. IEEE EDL vol. 36, no. 11, pp. 1157–1159, 2015.









P. Wang, et al. TED 2020

## FeFET (History Effect Physics and Mitigation)



Multi-domains have variations in coercive field (Eco), S2 state has more harder domains in large loop, thus needs higher field (E3) than (E1) to flip from S2 to S1 compared to minor loop Mitigation: Always erase (to ground state) before program (to intermediate state), to ensure operate on saturation loop



P. Wang, et al. TED 2020

#### **Oscillation Neuron based on Threshold Switch**



### Outline

- Background and Motivation
- Synaptic Devices: State-of-the-Art
- Variability and Reliability Characterization at Array-level
- Benchmark of Synaptic Devices for Inference and Training
- Chip-level Demonstrations: State-of-the-art

#### **RRAM Test Vehicle for Multilevel Characterization**





Multilevel RRAM for Storage Memory

(a)

- Winbond HfOx RRAM at 90nm (C. Ho, et al. IEDM 2017)
- RRAM is integrated between M1 and M2
- Originally developed for binary cell operation, now explored for multilevel operations
- Variability and reliability are characterized on 256x256 test vehicle with CMOS decoder
- For MLC storage, tail-to-tail gap is important; for compute-in-memory, the small deviation around center of each state is important. Therefore, the requirement is more stringent for analog synapse 21

#### Write-Verify Protocol to Tighten RRAM States



#### **3-bit Weight Programming on RRAM Array**

- 4kb cells tested for each state
- Write-verify loop number N



Y. Luo, et al., TED 2020

## Multilevel RRAM Stability (for Inference)



### Outline

- Background and Motivation
- Synaptic Devices: State-of-the-Art
- Variability and Reliability Characterization at Array-level
- Benchmark of Synaptic Devices for Inference and Training
- Chip-level Demonstrations: State-of-the-art

#### **DNN+NeuroSim Framework Overview**

- Integration of NeuroSim with Pytorch and Tensorflow
  - An end-to-end framework to benchmark configurable CIM-based hardware accelerators
- NeuroSim Core
  - Built upon a hierarchy of chip/tile/PE/subarray with all the necessary peripheral circuitry
  - Technology parameters calibrated with PTM model from 130nm to 7nm
  - Reports energy efficiency, throughput, area and memory utilization
- Python Wrapper
  - Defines arbitrary deep neural network and reports inference/training accuracy
  - Introduced device retention model and ADC quantization effects for inference
  - Introduced device nonlinearity/asymmetry and variation effects for training
- DNN+NeuroSim V1.3 for inference
  - Github: <u>https://github.com/neurosim/DNN\_NeuroSim\_V1.3</u>
- DNN+NeuroSim V2.1 for training
  - Github: <u>https://github.com/neurosim/DNN\_NeuroSim\_V2.1</u>
- Community: more than 300 users including Intel, TSMC, Samsung, and SK Hynix

#### **DNN+NeuroSim Key Features**



X. Peng, et al. IEDM 2019

#### **DNN+NeuroSim Methodologies**

Algorithm accuracy estimation based on WAGE method

- Hardware-aware quantization for weight, activation, gradient, error, as well as partial sum quantization based on ADC precision.
- Support various network models for CIFAR-10/-100 and ImageNet

Hardware metrics estimation based on analytic models that are calibrated with SPICE at module-level.

- Analog modules (e.g. ADC) calibrated with Cadence custom simulation;
- Digital modules estimated with standard cell area and logic gate delay/dynamic power/leakage power;
- Interconnect modules (e.g. H-tree) estimated with parasitic RC delay and power;



## **Analysis on ADC Precision**

- Inference Accuracy of VGG-8 (8-bit weight) on CIFAR-10
  - Sweep device precision & synaptic array size
  - Sweep ADC precision (non-linear quantization)



**ADC Ref** 

0

ount

-128

Σ 256 cells

(8+4) bits

bit

2

Help

4-bit/cell

Partial

Sum

128

## **Benchmark for Compute-in-Memory (Inference)**

| VGG-8 (8-bit activation; 8-bit weight) on CIFAR10, with Novel Weight Mapping and Dataflow |         |                      |         |                   |                    |            |                       |                      |
|-------------------------------------------------------------------------------------------|---------|----------------------|---------|-------------------|--------------------|------------|-----------------------|----------------------|
| Technology node (LSTP)                                                                    | 7nm     |                      | 22nm    |                   |                    |            |                       |                      |
| Device                                                                                    | 8T-SRAM | TPU-like<br>(Google) | 8T-SRAM | RRAM<br>(Winbond) | RRAM<br>(Tsinghua) | PCM (IBM)  | Si:HfO2<br>FeFET (GF) | TPU-like<br>(Google) |
| MLSA-ADC precision                                                                        | 4-bit   | 8-bit digital        | 4-bit   | 5-bit             | 5-bit              | 5-bit      | 5-bit                 | 8-bit digital        |
| Memory Cell Precision                                                                     | 1-bit   | MAC                  | 1-bit   | 2-bit             | 4-bit              | 4-bit      | 4-bit                 | MAC                  |
| Ron (Ω)                                                                                   |         | ١                    |         | 6k                | 100k               | <b>40k</b> | 240k                  | ١                    |
| On/Off Ratio                                                                              | \       |                      |         | 150               | 10                 | 12.5       | 100                   | ١                    |
| Inference Accuracy (%)                                                                    | 92%     |                      | 91%     |                   |                    | 92%        |                       |                      |
| Area (mm <sup>2</sup> )                                                                   | 13.61   | 15.71                | 59.05   | 60.78             | 33.26              | 33.39      | 32.69                 | 107.05               |
| Memory Utilization (%)                                                                    |         | 39.36%               | 98.73%  | 96.86%            | 93.47%             | 93.47%     | 93.47%                | 39.36%               |
| L-by-L Latency (ms)                                                                       | 0.63    | 0.53                 | 0.76    | 0.79              | 0.61               | 0.61       | 0.61                  | 0.75                 |
| L-by-L DynamicEnergy (uJ)                                                                 | 22.86   | 526.21               | 56.56   | 33.14             | 17.24              | 17.85      | 16.28                 | 687.25               |
| L-by-L Leakage power (mW)                                                                 | 1.47    | 75.96                | 1.11    | 0.17              | 0.09               | 0.09       | 0.09                  | 5.27                 |
| Energy Efficiency (TOPS/W)                                                                | 51.100  | 1.110                | 21.360  | 36.980            | 71.170             | 68.730     | 75.360                | 0.690                |
| Compute Efficiency (TOPS/mm2)                                                             | 0.144   | 0.075                | 0.027   | 0.026             | 0.061              | 0.060      | 0.060                 | 0.004                |

- Emerging NVMs outperform SRAM at the same tech node (e.g. at 22nm)
- Increasing on-state resistance (Ron) to >100k $\Omega$  is critical to improve the energy efficiency (TOPS/W)
- FeFET is promising due to high R<sub>on</sub> that is modulated by the gate voltage bias
- 7nm SRAM (if compute-in-memory) still achieves the best compute efficiency with area scaling advantage
- Compared to IEDM 2019 results, here we added the level shifter module for NVMs that need high write voltage

#### **Hybrid NVM+Capacitor for Training**



#### **Cap Leakage and Endurance Requirement**



- 10fA or below is needed for maintaining the retention time above ms and ensure no training accuracy loss
- Oxide channel transistor may be preferred with low leakage and large drive voltage to program NVM.

| Detect   | Number of write |           |  |  |
|----------|-----------------|-----------|--|--|
| Dalasel  | HPS             | Pure NVM  |  |  |
| CIFAR-10 | 750             | 37,500    |  |  |
| ImageNet | 20,000          | 6,250,000 |  |  |

Transfer interval = 10k images as example

For HPS, # write = # images per epoch \* # epochs / transfer interval CIFAR-10: 50k \* 150 epoch / 10k = 750 ImageNet: 1M \* 200 epoch / 10k = 20,000

For pure NVM, # write = # images per epoch \* # epochs / batch size CIFAR-10: 50k \* 150 epoch / 200 = 37,500 ImageNet: 1M \* 200 epoch / 32 = 6,250,000

### Outline

- Background and Motivation
- Synaptic Devices: State-of-the-Art
- Variability and Reliability Characterization at Array-level
- Benchmark of Synaptic Devices for Inference and Training
- Chip-level Demonstrations: State-of-the-art

### State-of-the-Art Industrial Emerging NVMs

• A survey of the industrial platforms (developed for embedded memories, not necessarily tailored for synaptic weights)





TSMC 40nm RRAM (ISSCC 18)



Intel 22nm RRAM (ISSCC 19)

#### **STT-MRAM:**

| 2Mb Cell Array     |                | 2Mb Cell Array |
|--------------------|----------------|----------------|
| ECC<br>Function IO | CTRL &<br>BIAS |                |
| 2Mb Cell Array     | ROW<br>DEC     | 2Mb Cell Array |

efune stune eluso elluse

FeFET:

efuse efuse

GF 28nm and 22nm FeFET (IEDM 16 & 17)

Intel 22nm STT (ISSCC 19)



(b)

MTJ

#### **Summary of RRAM-based CIM Macros**

|                     | ISSCC' 18<br>NTHU | ISSCC' 19<br>NTHU | ISSCC' 20<br>NTHU | TED' 20<br>ASU/GT | SSCL' 20<br>ASU/GT |
|---------------------|-------------------|-------------------|-------------------|-------------------|--------------------|
| Technology (nm)     | 65                | 55                | 22                | 90                | 90                 |
| No. of bit per cell | 1                 | 1                 | 1                 | 1                 | 2                  |
| Subarray size       | 512×256           | 256×512           | 512×512           | 128×64            | 128×64             |
| Capacity            | 1Mb               | 1Mb               | 2Mb               | 8Kb               | 8Kb                |
| Precision(I,W,O)    | 1,1,3             | 2,3,4             | 4,4,11            | 1,1,3             | 1,2,1              |
| Column sensing      | 3b ADC            | 4b ADC            | 6b ADC            | 3b ADC            | 1b SA              |
| # of rows turned on | 9                 | 9                 | 16                | 64                | 64                 |
| Supported algorithm | CNN               | CNN               | CNN               | CNN               | CNN                |
| Energy efficiency   | 0.6 TOPS/W        | 2.05 TOPS/W       | 3.79 TOPS/W       | 0.38 TOPS/W       | 1.61 TOPS/W        |
| Accuracy            | 98% (MNIST)       | 88.5% (CIFAR10)   | 90.18% (CIFAR10)  | 83.5% (CIFAR10)   | 87.1% (CIFAR10)    |

Note: TOPS/W is normalized to 8bit by 8 bit MAC (1b MAC = 2 ops)

TOSP/W is less than NeuroSim prediction, due to 1) older tech node, 2) partially # of rows turned-on

#### Secure-RRAM CIM Prototype Chip (TSMC 40nm)

External

Write

ADC

REF

GEN



| Technology                                    | TSMC 40nm w/ RRAM                   |                                       |  |
|-----------------------------------------------|-------------------------------------|---------------------------------------|--|
| Array size                                    | 128 x 128b                          |                                       |  |
| Weight precision<br>(bits)                    | 1, 2, 4, or 8                       |                                       |  |
| Rows turned on<br>simultaneously              | 7                                   |                                       |  |
| Operating voltage                             | 0.9V                                |                                       |  |
| Clock frequency                               | 100MHz                              |                                       |  |
|                                               | 0% Input Sparsity                   | 95% Input Sparsity                    |  |
| Compute efficiency<br>(GOPS/mm <sup>2</sup> ) | 36.01 (1x1b MAC)<br>4.50 (1x8b MAC) | 100.80 (1x1b MAC)<br>12.60 (1x8b MAC) |  |
| Energy efficiency<br>(TOPS/W)                 | 8.48 (1x1b MAC)<br>1.06 (1x8b MAC)  | 56.10 (1x1b MAC)<br>7.01 (1x8b MAC)   |  |

| Performance on<br>VGG-8                       | Sparsity Control<br>Enabled | Sparsity Control<br>Disabled |  |
|-----------------------------------------------|-----------------------------|------------------------------|--|
| CIFAR-10 accuracy                             | 90.4%                       | 91.9%                        |  |
| Compute efficiency<br>(GOPS/mm <sup>2</sup> ) | 83.50 (1x1b MAC)            | 36.01 (1x1b MAC)             |  |
| Energy efficiency<br>(TOPS/W)                 | 36.39 (1x1b MAC)            | 8.48 (1x1b MAC)              |  |

2<sup>nd</sup>-gen RRAM CIM chip taped-out (May 2021)

## **Challenges for RRAM-CIM Chip Design**

- Low R<sub>on</sub> → Large column current → Analog MUX at end of the column size up → Poor area efficiency
- High Vw → Large transistor needed for 1T1R cell → Bit cell size may be >30F<sup>2</sup>
- High Vw  $\rightarrow$  Significant area on the level shifters
- ADC area/power bottleneck → Multiple columns share one ADC → Time multiplexing required → Reduced throughput
- Process variation → ADC offset → Inaccurate partial sum computation → Inference accuracy degradation



## **Summary and Outlook**

- NVM (RRAM, PCM, and FeFET) can be tuned to multilevel (possibly by iterative writeverify), and the read-intensive inference is most suitable application with advantages over SRAM (e.g. low leakage and non-volatility) for edge intelligence.
- FeFET is the most promising candidate with features like improved on-state resistance (>100kΩ) with gate biasing, and low write energy (~fJ/bit) due to field-driven switching, fast read/write speed (~10ns), and 2-5 bit/cell potential. Need to build array-level test vehicles (e.g. GF' 28nm) for characterizing statistics.
- NVM based inference engine still faces challenges such as high write voltage and low on-state resistance, ADC overhead, intermediate state stability, process variation caused inference accuracy degradation, etc.
- DNN+NeuroSim is an integrated framework for benchmarking different CIM technologies that is open source to the research community.

#### Acknowledgement

**Students/Postdoc:** Xiaochen Peng, Yandong Luo, Wonbo Shim, Panni Wang, Hongwu Jiang, Xiaoyu Sun, etc.

