Deep learning acceleration: from quantization to in-memory computing

Deep learning has demonstrated high accuracy and efficiency in various applications. For example, Convolutional Neural Networks (CNNs) widely adopted in Computer Vision (CV) and Transformers broadly applied in Natural Language Processing (NLP) are representative deep learning models. Deep learning m...

Full description

Bibliographic Details
Main Author: Zhu, Shien
Other Authors: Weichen Liu
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/163448
_version_ 1811679480916213760
author Zhu, Shien
author2 Weichen Liu
author_facet Weichen Liu
Zhu, Shien
author_sort Zhu, Shien
collection NTU
description Deep learning has demonstrated high accuracy and efficiency in various applications. For example, Convolutional Neural Networks (CNNs) widely adopted in Computer Vision (CV) and Transformers broadly applied in Natural Language Processing (NLP) are representative deep learning models. Deep learning models have grown deeper and larger in the past few years to obtain higher accuracy. Meanwhile, these deep learning models bring challenges to inference on the edge. These computational-intensive and memory-intensive deep learning models not only are bounded by limited computational resources but also suffer from the long latency and high energy of heavy memory access. Therefore, accelerating deep learning inference on the edge need software/hardware co-optimization. From the software perspective, thanks to the fault-tolerance nature of deep learning models, quantizing the 32-bit values to low-bitwidth values effectively reduces the model size and the computational complexity. Ternary and binary neural networks are representative quantized networks that achieve 16-32X model size reduction and up to 64X theoretical speedup. However, due to inefficient encoding and dot product, the ternary and binary low-bitwidth storage schemes and arithmetic operations are inefficient on Central Processing Unit (CPU) and Graphic Processing Unit (GPU) platforms. Existing ternary and binary encoding schemes are complex and incompatible. In addition, current ternary and binary dot products contain redundant operations, and mixed-precision ternary and binary dot products are missing. Among various deep learning models, Ternary Weight Network (TWN) and Adder Neural Network (AdderNet) are two other promising neural networks with higher accuracy than ternary and binary neural networks. Moreover, compared with integer quantization and full-precision models, TWN and AdderNet have a unique advantage: they replace the multiplication operations with lightweight addition and subtraction operations, which are favoured by In-Memory Computing (IMC) architectures. From the hardware perspective, IMC architectures compute inside the Non-Volatile Memory (NVM) arrays to reduce the data movement overhead. IMC architectures conduct addition and boolean operations in parallel, which is excellent for accelerating addition-centric deep learning models like TWNs and AdderNet. However, the addition and subtraction operators and data mapping schemes for deep learning models on existing IMC designs are not fully optimized. In this thesis, we accelerate deep learning inference from both software and hardware perspectives. Firstly, on the software side, we propose TAB to accelerate quantized ternary and binary deep learning models on the edge. First, we propose a unified value representation based on standard signed integer encoding. Second, we introduce a bitwidth-last data storage format to avoid the overhead of extracting the sign bit. Third, we propose ternary and binary bitwise dot products based on Gated-XOR, reducing 25% to 61% operations than State-Of-The-Art (SOTA) methods. Finally, we implement TAB on both CPU and GPU platforms as an open-source library with optimized bitwise kernels. Experiment results show that TAB's ternary and binary neural networks achieve up to 34.6X to 72.2X speedup than full-precision ones. Next, on the hardware side, we propose an in-memory accelerator FAT for TWNs with three contributions: a fast addition scheme that can avoid the time overhead of carry propagation and writing back, a sparse addition control unit utilizing the sparsity to skip operations on zero weights, and a combined-stationary data mapping to reduce the data movement and increase the parallelism across memory columns. Compared with SOTA IMC accelerators, FAT achieves 10.02X speedup and 12.19X energy efficiency on networks with 80% average sparsity. Last, we propose another in-memory accelerator iMAD for AdderNet. First, we co-optimize in-memory subtraction and addition operators to reduce the latency, energy, and sensing circuit area. Second, we design an accelerator architecture for AdderNet with high parallelism based on the optimized operators. Third, we propose an IMC-friendly computation pipeline for AdderNet convolution at the algorithm level to further boost the performance. Evaluation results show that our accelerator iMAD achieves 3.25X speedup and 3.55X energy efficiency compared with a SOTA in-memory accelerator. In summary, we accelerate deep learning models through software/hardware co-design. We propose a unified and optimized ternary and binary inference framework with unified encoding, optimized data storage, efficient bitwise dot product, and a programming library on existing CPU and GPU platforms. We further propose two hardware accelerators for TWNs and AdderNet with optimized operators, architectures, algorithms, and data mapping schemes on emerging in-memory computing platforms. In the future, we will extend the in-memory computing architectures to accelerate other types of deep learning models, for example, Transformers. We will also research general-purpose in-memory computing by integrating lightweight RISC-V CPU cores with computational memory arrays.
first_indexed 2024-10-01T03:09:50Z
format Thesis-Doctor of Philosophy
id ntu-10356/163448
institution Nanyang Technological University
language English
last_indexed 2024-10-01T03:09:50Z
publishDate 2022
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1634482023-01-03T05:05:24Z Deep learning acceleration: from quantization to in-memory computing Zhu, Shien Weichen Liu School of Computer Science and Engineering Parallel and Distributed Computing Centre liu@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Hardware::Arithmetic and logic structures Engineering::Computer science and engineering::Computer systems organization::Special-purpose and application-based systems Deep learning has demonstrated high accuracy and efficiency in various applications. For example, Convolutional Neural Networks (CNNs) widely adopted in Computer Vision (CV) and Transformers broadly applied in Natural Language Processing (NLP) are representative deep learning models. Deep learning models have grown deeper and larger in the past few years to obtain higher accuracy. Meanwhile, these deep learning models bring challenges to inference on the edge. These computational-intensive and memory-intensive deep learning models not only are bounded by limited computational resources but also suffer from the long latency and high energy of heavy memory access. Therefore, accelerating deep learning inference on the edge need software/hardware co-optimization. From the software perspective, thanks to the fault-tolerance nature of deep learning models, quantizing the 32-bit values to low-bitwidth values effectively reduces the model size and the computational complexity. Ternary and binary neural networks are representative quantized networks that achieve 16-32X model size reduction and up to 64X theoretical speedup. However, due to inefficient encoding and dot product, the ternary and binary low-bitwidth storage schemes and arithmetic operations are inefficient on Central Processing Unit (CPU) and Graphic Processing Unit (GPU) platforms. Existing ternary and binary encoding schemes are complex and incompatible. In addition, current ternary and binary dot products contain redundant operations, and mixed-precision ternary and binary dot products are missing. Among various deep learning models, Ternary Weight Network (TWN) and Adder Neural Network (AdderNet) are two other promising neural networks with higher accuracy than ternary and binary neural networks. Moreover, compared with integer quantization and full-precision models, TWN and AdderNet have a unique advantage: they replace the multiplication operations with lightweight addition and subtraction operations, which are favoured by In-Memory Computing (IMC) architectures. From the hardware perspective, IMC architectures compute inside the Non-Volatile Memory (NVM) arrays to reduce the data movement overhead. IMC architectures conduct addition and boolean operations in parallel, which is excellent for accelerating addition-centric deep learning models like TWNs and AdderNet. However, the addition and subtraction operators and data mapping schemes for deep learning models on existing IMC designs are not fully optimized. In this thesis, we accelerate deep learning inference from both software and hardware perspectives. Firstly, on the software side, we propose TAB to accelerate quantized ternary and binary deep learning models on the edge. First, we propose a unified value representation based on standard signed integer encoding. Second, we introduce a bitwidth-last data storage format to avoid the overhead of extracting the sign bit. Third, we propose ternary and binary bitwise dot products based on Gated-XOR, reducing 25% to 61% operations than State-Of-The-Art (SOTA) methods. Finally, we implement TAB on both CPU and GPU platforms as an open-source library with optimized bitwise kernels. Experiment results show that TAB's ternary and binary neural networks achieve up to 34.6X to 72.2X speedup than full-precision ones. Next, on the hardware side, we propose an in-memory accelerator FAT for TWNs with three contributions: a fast addition scheme that can avoid the time overhead of carry propagation and writing back, a sparse addition control unit utilizing the sparsity to skip operations on zero weights, and a combined-stationary data mapping to reduce the data movement and increase the parallelism across memory columns. Compared with SOTA IMC accelerators, FAT achieves 10.02X speedup and 12.19X energy efficiency on networks with 80% average sparsity. Last, we propose another in-memory accelerator iMAD for AdderNet. First, we co-optimize in-memory subtraction and addition operators to reduce the latency, energy, and sensing circuit area. Second, we design an accelerator architecture for AdderNet with high parallelism based on the optimized operators. Third, we propose an IMC-friendly computation pipeline for AdderNet convolution at the algorithm level to further boost the performance. Evaluation results show that our accelerator iMAD achieves 3.25X speedup and 3.55X energy efficiency compared with a SOTA in-memory accelerator. In summary, we accelerate deep learning models through software/hardware co-design. We propose a unified and optimized ternary and binary inference framework with unified encoding, optimized data storage, efficient bitwise dot product, and a programming library on existing CPU and GPU platforms. We further propose two hardware accelerators for TWNs and AdderNet with optimized operators, architectures, algorithms, and data mapping schemes on emerging in-memory computing platforms. In the future, we will extend the in-memory computing architectures to accelerate other types of deep learning models, for example, Transformers. We will also research general-purpose in-memory computing by integrating lightweight RISC-V CPU cores with computational memory arrays. Doctor of Philosophy 2022-12-08T00:48:55Z 2022-12-08T00:48:55Z 2022 Thesis-Doctor of Philosophy Zhu, S. (2022). Deep learning acceleration: from quantization to in-memory computing. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/163448 https://hdl.handle.net/10356/163448 10.32657/10356/163448 en MOE2019-T2-1-071 MOE2019-T1-001-072 M4082282 M4082087 10.21979/N9/DYKUPV 10.21979/N9/RZ75BY 10.21979/N9/JNFW9P 10.21979/N9/XEH3D1 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Hardware::Arithmetic and logic structures
Engineering::Computer science and engineering::Computer systems organization::Special-purpose and application-based systems
Zhu, Shien
Deep learning acceleration: from quantization to in-memory computing
title Deep learning acceleration: from quantization to in-memory computing
title_full Deep learning acceleration: from quantization to in-memory computing
title_fullStr Deep learning acceleration: from quantization to in-memory computing
title_full_unstemmed Deep learning acceleration: from quantization to in-memory computing
title_short Deep learning acceleration: from quantization to in-memory computing
title_sort deep learning acceleration from quantization to in memory computing
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Hardware::Arithmetic and logic structures
Engineering::Computer science and engineering::Computer systems organization::Special-purpose and application-based systems
url https://hdl.handle.net/10356/163448
work_keys_str_mv AT zhushien deeplearningaccelerationfromquantizationtoinmemorycomputing