Hardware-software co-exploration and optimization for next-generation learning machines

In an era dominated by the rapid evolution of Machine Learning (ML), particularly Deep Learning (DL), the efficient deployment of learning algorithms on power- and area-constrained hardware remains a paramount challenge. The scaling up of DL models to trillions of parameters and trillions of computa...

Full description

Bibliographic Details
Main Author: Chen, Chunyun
Other Authors: Mohamed M. Sabry Aly
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/178423
_version_ 1826115324870656000
author Chen, Chunyun
author2 Mohamed M. Sabry Aly
author_facet Mohamed M. Sabry Aly
Chen, Chunyun
author_sort Chen, Chunyun
collection NTU
description In an era dominated by the rapid evolution of Machine Learning (ML), particularly Deep Learning (DL), the efficient deployment of learning algorithms on power- and area-constrained hardware remains a paramount challenge. The scaling up of DL models to trillions of parameters and trillions of computation operations, challenges the modest gains in energy efficiency and memory density derived from silicon scaling, making current DL hardware systems unsustainable. Therefore, this thesis delivers a comprehensive investigation into hardware-software co-design and optimization for next-generation learning machines, especially the co-design of both the special function hardware and the end-to-end full workload hardware, with detailed system-level impacts of DL hardware accelerators. The key design metrics for next-generation learning machines are energy efficiency, performance, and area overheads. To enable DL workloads running on resource-constrained hardware platforms, reducing the memory footprint is essential. One way to do this is through efficient entropy coding schemes. Commonly employed Fixed-to-Variable (F2V) entropy coding methods, e.g., Huffman coding and Arithmetic coding, are hardware-unfriendly, and cannot fully benefit from the resulting reduced memory requirement. We introduce adopting Tunstall coding — a Variable-to-Fixed (V2F) coding scheme, for DNN models compression and introduce two Tunstall decoders — the Logic Oriented and the Memory Oriented decoders, with up to a 20× decrease in memory usage and a 100× reduction in energy consumption compared to 32-bit DNNs. Furthermore, these decoders process data 3× to 6× faster than F2V coding schemes. Apart from the convolutional layer in Convolutional Neural Networks (CNNs) and the Multi-Head Attention (MHA) in Transformers, DL workloads also contain non-linear operations that are not easily parallelizable, posing a challenge in hardware implementations. One of which, Non-Maximum Suppression (NMS), is a critical step in object detection frameworks, and is a computational bottleneck when mapping the frameworks into the hardware system due to its computational intensity. Existing NMS optimizations don’t effectively parallelize on ASIC platforms. The introduced ShapoolNMS overcomes this limitation. Empowered by both low computation complexity and hardware/software co-optimization, ShapoolNMS is up to 42, 713× faster than conventional GreedyNMS software implementations. In this thesis, we also look beyond a single layer in the DL workloads, and introduce two end-to-end workload accelerators for the entire CNN-based and Transformer-based DL workloads, respectively. Current DL accelerators mainly target either the convolutional operations of CNNs or the Multi-Head Attention (MHA) in Transformers. The acceleration of the entire workload is less explored. This thesis introduces CNN-DLA, a Chiplet-based scalable hardware accelerator for CNN-based models with a showcase of ResNet-152, and ViTA for the ViT workload. With the introduced cross-layer optimization dataflow, CNN-DLA achieves significant memory requirement reductions by 84.85%, and the 44-Chiplet achieves a performance of 68 FPS on ResNet-152 with full-HD images as input. Similarly, ViTA reduces memory requirements by 40.5% and adaptable performance of 0.20-16.38 TOPS, with the area and power consumption of 2.00-6.79 mm2 and 0.22-10.40 W, respectively, making it suited for diverse applications. Additionally, we also provide detailed guidelines for integrating the introduced accelerators into a real hardware platform, i.e., the PULPissimo System-on-Chip (SoC) platform, including the interfaces, register map, and the finite state machine (FSM) of the integrated accelerators. Overall, this thesis provides a foundation for a scalable DL accelerator and the hardware-software co-design and co-exploration for learning machines. The introduced methods not only address current hardware limitations but also set a direction for sustainable and efficient DL hardware systems in the future.
first_indexed 2024-10-01T03:53:38Z
format Thesis-Doctor of Philosophy
id ntu-10356/178423
institution Nanyang Technological University
language English
last_indexed 2024-10-01T03:53:38Z
publishDate 2024
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1784232024-07-05T03:11:43Z Hardware-software co-exploration and optimization for next-generation learning machines Chen, Chunyun Mohamed M. Sabry Aly College of Computing and Data Science msabry@ntu.edu.sg Computer and Information Science ASIC Transformer Vision transformer GELU Softmax LayerNorm Learning machine In an era dominated by the rapid evolution of Machine Learning (ML), particularly Deep Learning (DL), the efficient deployment of learning algorithms on power- and area-constrained hardware remains a paramount challenge. The scaling up of DL models to trillions of parameters and trillions of computation operations, challenges the modest gains in energy efficiency and memory density derived from silicon scaling, making current DL hardware systems unsustainable. Therefore, this thesis delivers a comprehensive investigation into hardware-software co-design and optimization for next-generation learning machines, especially the co-design of both the special function hardware and the end-to-end full workload hardware, with detailed system-level impacts of DL hardware accelerators. The key design metrics for next-generation learning machines are energy efficiency, performance, and area overheads. To enable DL workloads running on resource-constrained hardware platforms, reducing the memory footprint is essential. One way to do this is through efficient entropy coding schemes. Commonly employed Fixed-to-Variable (F2V) entropy coding methods, e.g., Huffman coding and Arithmetic coding, are hardware-unfriendly, and cannot fully benefit from the resulting reduced memory requirement. We introduce adopting Tunstall coding — a Variable-to-Fixed (V2F) coding scheme, for DNN models compression and introduce two Tunstall decoders — the Logic Oriented and the Memory Oriented decoders, with up to a 20× decrease in memory usage and a 100× reduction in energy consumption compared to 32-bit DNNs. Furthermore, these decoders process data 3× to 6× faster than F2V coding schemes. Apart from the convolutional layer in Convolutional Neural Networks (CNNs) and the Multi-Head Attention (MHA) in Transformers, DL workloads also contain non-linear operations that are not easily parallelizable, posing a challenge in hardware implementations. One of which, Non-Maximum Suppression (NMS), is a critical step in object detection frameworks, and is a computational bottleneck when mapping the frameworks into the hardware system due to its computational intensity. Existing NMS optimizations don’t effectively parallelize on ASIC platforms. The introduced ShapoolNMS overcomes this limitation. Empowered by both low computation complexity and hardware/software co-optimization, ShapoolNMS is up to 42, 713× faster than conventional GreedyNMS software implementations. In this thesis, we also look beyond a single layer in the DL workloads, and introduce two end-to-end workload accelerators for the entire CNN-based and Transformer-based DL workloads, respectively. Current DL accelerators mainly target either the convolutional operations of CNNs or the Multi-Head Attention (MHA) in Transformers. The acceleration of the entire workload is less explored. This thesis introduces CNN-DLA, a Chiplet-based scalable hardware accelerator for CNN-based models with a showcase of ResNet-152, and ViTA for the ViT workload. With the introduced cross-layer optimization dataflow, CNN-DLA achieves significant memory requirement reductions by 84.85%, and the 44-Chiplet achieves a performance of 68 FPS on ResNet-152 with full-HD images as input. Similarly, ViTA reduces memory requirements by 40.5% and adaptable performance of 0.20-16.38 TOPS, with the area and power consumption of 2.00-6.79 mm2 and 0.22-10.40 W, respectively, making it suited for diverse applications. Additionally, we also provide detailed guidelines for integrating the introduced accelerators into a real hardware platform, i.e., the PULPissimo System-on-Chip (SoC) platform, including the interfaces, register map, and the finite state machine (FSM) of the integrated accelerators. Overall, this thesis provides a foundation for a scalable DL accelerator and the hardware-software co-design and co-exploration for learning machines. The introduced methods not only address current hardware limitations but also set a direction for sustainable and efficient DL hardware systems in the future. Doctor of Philosophy 2024-06-20T06:56:42Z 2024-06-20T06:56:42Z 2024 Thesis-Doctor of Philosophy Chen, C. (2024). Hardware-software co-exploration and optimization for next-generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 https://hdl.handle.net/10356/178423 10.32657/10356/178423 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Computer and Information Science
ASIC
Transformer
Vision transformer
GELU
Softmax
LayerNorm
Learning machine
Chen, Chunyun
Hardware-software co-exploration and optimization for next-generation learning machines
title Hardware-software co-exploration and optimization for next-generation learning machines
title_full Hardware-software co-exploration and optimization for next-generation learning machines
title_fullStr Hardware-software co-exploration and optimization for next-generation learning machines
title_full_unstemmed Hardware-software co-exploration and optimization for next-generation learning machines
title_short Hardware-software co-exploration and optimization for next-generation learning machines
title_sort hardware software co exploration and optimization for next generation learning machines
topic Computer and Information Science
ASIC
Transformer
Vision transformer
GELU
Softmax
LayerNorm
Learning machine
url https://hdl.handle.net/10356/178423
work_keys_str_mv AT chenchunyun hardwaresoftwarecoexplorationandoptimizationfornextgenerationlearningmachines