All Analog CNN Accelerator with RRAMs for Fast Inference

As AI applications become more prevalent and powerful, the performance of deep learning neural network is more demanding. The need to enable fast and energy efficient circuits for computing deep neural networks is urgent. Most current research works propose dedicated hardware for data to reuse thous...

Full description

Bibliographic Details
Main Author: Chao, Minghan
Other Authors: Shulaker, Max
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/146297
Description
Summary:As AI applications become more prevalent and powerful, the performance of deep learning neural network is more demanding. The need to enable fast and energy efficient circuits for computing deep neural networks is urgent. Most current research works propose dedicated hardware for data to reuse thousands of times. However, while re-using the same hardware to perform the same computation repeatedly saves area, it comes at the expense of execution time. This presents another critical obstacle, as the need for real-data and rapid AI requires a fundamentally faster approach to implementing neural networks. The focus of this thesis is to duplicate the key operation – multiply and accumulate (MAC) computation units, in the hardware so that there is no hardware re-use, enabling the entire neural network to be physically fabricated on a single chip. As neural networks today often require hundreds of thousands to tens of millions of MAC computation units, this requires designing the smallest MAC computation units to fit all of the operations on chip. Here, we present initial analysis on a convolutional neural network (CNN) accelerator that implements such a system, optimizing for inference speed. The accelerator duplicates all of the computation hardware, thus eliminating the need to fetch data back and forth while reusing the same hardware. We propose a novel design for memory cells using resistive random access memory (RRAM) and computation units utilizing the analog behavior of transistors. This circuit classifies one Cifar-10 dataset image in 6µs (160k frames/s) with 2.4µJ energy per classification with an accuracy of 85%. It contains 7.5 million MAC units and achieves 5 million MAC/mm².