Hardware-software co-exploration and optimization for next-generation learning machines

In an era dominated by the rapid evolution of Machine Learning (ML), particularly Deep Learning (DL), the efficient deployment of learning algorithms on power- and area-constrained hardware remains a paramount challenge. The scaling up of DL models to trillions of parameters and trillions of computa...

Full description

Bibliographic Details
Main Author:	Chen, Chunyun
Other Authors:	Mohamed M. Sabry Aly
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science ASIC Transformer Vision transformer GELU Softmax LayerNorm Learning machine
Online Access:	https://hdl.handle.net/10356/178423

_version_	1826115324870656000
author	Chen, Chunyun
author2	Mohamed M. Sabry Aly
author_facet	Mohamed M. Sabry Aly Chen, Chunyun
author_sort	Chen, Chunyun
collection	NTU
description	In an era dominated by the rapid evolution of Machine Learning (ML), particularly Deep Learning (DL), the efficient deployment of learning algorithms on power- and area-constrained hardware remains a paramount challenge. The scaling up of DL models to trillions of parameters and trillions of computation operations, challenges the modest gains in energy efficiency and memory density derived from silicon scaling, making current DL hardware systems unsustainable. Therefore, this thesis delivers a comprehensive investigation into hardware-software co-design and optimization for next-generation learning machines, especially the co-design of both the special function hardware and the end-to-end full workload hardware, with detailed system-level impacts of DL hardware accelerators. The key design metrics for next-generation learning machines are energy efficiency, performance, and area overheads. To enable DL workloads running on resource-constrained hardware platforms, reducing the memory footprint is essential. One way to do this is through efficient entropy coding schemes. Commonly employed Fixed-to-Variable (F2V) entropy coding methods, e.g., Huffman coding and Arithmetic coding, are hardware-unfriendly, and cannot fully benefit from the resulting reduced memory requirement. We introduce adopting Tunstall coding — a Variable-to-Fixed (V2F) coding scheme, for DNN models compression and introduce two Tunstall decoders — the Logic Oriented and the Memory Oriented decoders, with up to a 20× decrease in memory usage and a 100× reduction in energy consumption compared to 32-bit DNNs. Furthermore, these decoders process data 3× to 6× faster than F2V coding schemes. Apart from the convolutional layer in Convolutional Neural Networks (CNNs) and the Multi-Head Attention (MHA) in Transformers, DL workloads also contain non-linear operations that are not easily parallelizable, posing a challenge in hardware implementations. One of which, Non-Maximum Suppression (NMS), is a critical step in object detection frameworks, and is a computational bottleneck when mapping the frameworks into the hardware system due to its computational intensity. Existing NMS optimizations don’t effectively parallelize on ASIC platforms. The introduced ShapoolNMS overcomes this limitation. Empowered by both low computation complexity and hardware/software co-optimization, ShapoolNMS is up to 42, 713× faster than conventional GreedyNMS software implementations. In this thesis, we also look beyond a single layer in the DL workloads, and introduce two end-to-end workload accelerators for the entire CNN-based and Transformer-based DL workloads, respectively. Current DL accelerators mainly target either the convolutional operations of CNNs or the Multi-Head Attention (MHA) in Transformers. The acceleration of the entire workload is less explored. This thesis introduces CNN-DLA, a Chiplet-based scalable hardware accelerator for CNN-based models with a showcase of ResNet-152, and ViTA for the ViT workload. With the introduced cross-layer optimization dataflow, CNN-DLA achieves significant memory requirement reductions by 84.85%, and the 44-Chiplet achieves a performance of 68 FPS on ResNet-152 with full-HD images as input. Similarly, ViTA reduces memory requirements by 40.5% and adaptable performance of 0.20-16.38 TOPS, with the area and power consumption of 2.00-6.79 mm2 and 0.22-10.40 W, respectively, making it suited for diverse applications. Additionally, we also provide detailed guidelines for integrating the introduced accelerators into a real hardware platform, i.e., the PULPissimo System-on-Chip (SoC) platform, including the interfaces, register map, and the finite state machine (FSM) of the integrated accelerators. Overall, this thesis provides a foundation for a scalable DL accelerator and the hardware-software co-design and co-exploration for learning machines. The introduced methods not only address current hardware limitations but also set a direction for sustainable and efficient DL hardware systems in the future.
first_indexed	2024-10-01T03:53:38Z
format	Thesis-Doctor of Philosophy
id	ntu-10356/178423
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T03:53:38Z
publishDate	2024
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1784232024-07-05T03:11:43Z Hardware-software co-exploration and optimization for next-generation learning machines Chen, Chunyun Mohamed M. Sabry Aly College of Computing and Data Science msabry@ntu.edu.sg Computer and Information Science ASIC Transformer Vision transformer GELU Softmax LayerNorm Learning machine In an era dominated by the rapid evolution of Machine Learning (ML), particularly Deep Learning (DL), the efficient deployment of learning algorithms on power- and area-constrained hardware remains a paramount challenge. The scaling up of DL models to trillions of parameters and trillions of computation operations, challenges the modest gains in energy efficiency and memory density derived from silicon scaling, making current DL hardware systems unsustainable. Therefore, this thesis delivers a comprehensive investigation into hardware-software co-design and optimization for next-generation learning machines, especially the co-design of both the special function hardware and the end-to-end full workload hardware, with detailed system-level impacts of DL hardware accelerators. The key design metrics for next-generation learning machines are energy efficiency, performance, and area overheads. To enable DL workloads running on resource-constrained hardware platforms, reducing the memory footprint is essential. One way to do this is through efficient entropy coding schemes. Commonly employed Fixed-to-Variable (F2V) entropy coding methods, e.g., Huffman coding and Arithmetic coding, are hardware-unfriendly, and cannot fully benefit from the resulting reduced memory requirement. We introduce adopting Tunstall coding — a Variable-to-Fixed (V2F) coding scheme, for DNN models compression and introduce two Tunstall decoders — the Logic Oriented and the Memory Oriented decoders, with up to a 20× decrease in memory usage and a 100× reduction in energy consumption compared to 32-bit DNNs. Furthermore, these decoders process data 3× to 6× faster than F2V coding schemes. Apart from the convolutional layer in Convolutional Neural Networks (CNNs) and the Multi-Head Attention (MHA) in Transformers, DL workloads also contain non-linear operations that are not easily parallelizable, posing a challenge in hardware implementations. One of which, Non-Maximum Suppression (NMS), is a critical step in object detection frameworks, and is a computational bottleneck when mapping the frameworks into the hardware system due to its computational intensity. Existing NMS optimizations don’t effectively parallelize on ASIC platforms. The introduced ShapoolNMS overcomes this limitation. Empowered by both low computation complexity and hardware/software co-optimization, ShapoolNMS is up to 42, 713× faster than conventional GreedyNMS software implementations. In this thesis, we also look beyond a single layer in the DL workloads, and introduce two end-to-end workload accelerators for the entire CNN-based and Transformer-based DL workloads, respectively. Current DL accelerators mainly target either the convolutional operations of CNNs or the Multi-Head Attention (MHA) in Transformers. The acceleration of the entire workload is less explored. This thesis introduces CNN-DLA, a Chiplet-based scalable hardware accelerator for CNN-based models with a showcase of ResNet-152, and ViTA for the ViT workload. With the introduced cross-layer optimization dataflow, CNN-DLA achieves significant memory requirement reductions by 84.85%, and the 44-Chiplet achieves a performance of 68 FPS on ResNet-152 with full-HD images as input. Similarly, ViTA reduces memory requirements by 40.5% and adaptable performance of 0.20-16.38 TOPS, with the area and power consumption of 2.00-6.79 mm2 and 0.22-10.40 W, respectively, making it suited for diverse applications. Additionally, we also provide detailed guidelines for integrating the introduced accelerators into a real hardware platform, i.e., the PULPissimo System-on-Chip (SoC) platform, including the interfaces, register map, and the finite state machine (FSM) of the integrated accelerators. Overall, this thesis provides a foundation for a scalable DL accelerator and the hardware-software co-design and co-exploration for learning machines. The introduced methods not only address current hardware limitations but also set a direction for sustainable and efficient DL hardware systems in the future. Doctor of Philosophy 2024-06-20T06:56:42Z 2024-06-20T06:56:42Z 2024 Thesis-Doctor of Philosophy Chen, C. (2024). Hardware-software co-exploration and optimization for next-generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 https://hdl.handle.net/10356/178423 10.32657/10356/178423 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle	Computer and Information Science ASIC Transformer Vision transformer GELU Softmax LayerNorm Learning machine Chen, Chunyun Hardware-software co-exploration and optimization for next-generation learning machines
title	Hardware-software co-exploration and optimization for next-generation learning machines
title_full	Hardware-software co-exploration and optimization for next-generation learning machines
title_fullStr	Hardware-software co-exploration and optimization for next-generation learning machines
title_full_unstemmed	Hardware-software co-exploration and optimization for next-generation learning machines
title_short	Hardware-software co-exploration and optimization for next-generation learning machines
title_sort	hardware software co exploration and optimization for next generation learning machines
topic	Computer and Information Science ASIC Transformer Vision transformer GELU Softmax LayerNorm Learning machine
url	https://hdl.handle.net/10356/178423
work_keys_str_mv	AT chenchunyun hardwaresoftwarecoexplorationandoptimizationfornextgenerationlearningmachines

Hardware-software co-exploration and optimization for next-generation learning machines

Similar Items