A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks

Convolutional Neural Networks (CNN) are widely employed in the contemporary artificial intelligence systems. However these models have millions of connections between the layers, that are both memory prohibitive and computationally expensive. Employing these models on an embedded mobile application...

Full description

Bibliographic Details
Main Authors:	Suhas Shivapakash, Hardik Jain, Olaf Hellwich, Friedel Gerfers
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Open Journal of Circuits and Systems
Subjects:	Deep neural network AlexNet MobileNet SqueezeNet EfficientNet truncation
Online Access:	https://ieeexplore.ieee.org/document/9335311/

_version_	1819163355817967616
author	Suhas Shivapakash Hardik Jain Olaf Hellwich Friedel Gerfers
author_facet	Suhas Shivapakash Hardik Jain Olaf Hellwich Friedel Gerfers
author_sort	Suhas Shivapakash
collection	DOAJ
description	Convolutional Neural Networks (CNN) are widely employed in the contemporary artificial intelligence systems. However these models have millions of connections between the layers, that are both memory prohibitive and computationally expensive. Employing these models on an embedded mobile application is resource limited with high power consumption and significant bandwidth requirement to access the data from the off-chip DRAM. Reducing the data movement between the on-chip and off-chip DRAM is the main criteria to achieve high throughput and overall better energy efficiency. Our proposed multi-bit accelerator achieves these goals by employing the truncation of the partial sum (Psum) results of the preceding layer before feeding it into the next layer. We exhibit the architecture by inferencing 32-bits for the first convolution layers and sequentially truncate the bits on the MSB/LSB of integer and fractional part without any further training on the original network. At the last fully connected layer, the top-1 accuracy is maintained with the reduced bit width of 14 and top-5 accuracy upto 10-bit width. The computation engine consists of an systolic array of 1024 processing elements (PE). Large CNNs such as AlexNet, MobileNet, SqueezeNet and EfficientNet were used as benchmark CNN model and Virtex Ultrascale FPGA was used to test the architecture. The proposed truncation scheme has 49% power reduction and resource utilization was reduced by 73.25% for LUTs (Look-up tables), 68.76% for FFs (Flip-Flops), 74.60% for BRAMs (Block RAMs) and 79.425% for Digital Signal Processors (DSPs) when compared with the 32 bits architecture. The design achieves a performance of 223.69 GOPS on a Virtex Ultrascale FPGA, the design has a overall gain of 3.63 × throughput when compared to other prior FPGA accelerators. In addition, the overall power consumption is 4.5 × lower when compared to other prior architectures. The ASIC version of the accelerator was designed in 22nm FDSOI CMOS process to achieve a overall throughput of 2.03 TOPS/W with a total power consumption of 791 mW and with a area of 1 mm × 1.2 mm.
first_indexed	2024-12-22T17:42:49Z
format	Article
id	doaj.art-d49c593f68e7493ea26fa05032266638
institution	Directory Open Access Journal
issn	2644-1225
language	English
last_indexed	2024-12-22T17:42:49Z
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of Circuits and Systems
spelling	doaj.art-d49c593f68e7493ea26fa050322666382022-12-21T18:18:23ZengIEEEIEEE Open Journal of Circuits and Systems2644-12252021-01-01215616910.1109/OJCAS.2020.30472259335311A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural NetworksSuhas Shivapakash0https://orcid.org/0000-0002-9173-213XHardik Jain1https://orcid.org/0000-0001-9499-8040Olaf Hellwich2https://orcid.org/0000-0002-2871-9266Friedel Gerfers3https://orcid.org/0000-0002-0520-1923Department of Computer Engineering and Microelectronics, Chair of Mixed Signal Circuit Design, Technical University of Berlin, Berlin, GermanyDepartment of Computer Engineering and Microelectronics, Computer Vision and Remote Sensing, Technical University of Berlin, Berlin, GermanyDepartment of Computer Engineering and Microelectronics, Computer Vision and Remote Sensing, Technical University of Berlin, Berlin, GermanyDepartment of Computer Engineering and Microelectronics, Chair of Mixed Signal Circuit Design, Technical University of Berlin, Berlin, GermanyConvolutional Neural Networks (CNN) are widely employed in the contemporary artificial intelligence systems. However these models have millions of connections between the layers, that are both memory prohibitive and computationally expensive. Employing these models on an embedded mobile application is resource limited with high power consumption and significant bandwidth requirement to access the data from the off-chip DRAM. Reducing the data movement between the on-chip and off-chip DRAM is the main criteria to achieve high throughput and overall better energy efficiency. Our proposed multi-bit accelerator achieves these goals by employing the truncation of the partial sum (Psum) results of the preceding layer before feeding it into the next layer. We exhibit the architecture by inferencing 32-bits for the first convolution layers and sequentially truncate the bits on the MSB/LSB of integer and fractional part without any further training on the original network. At the last fully connected layer, the top-1 accuracy is maintained with the reduced bit width of 14 and top-5 accuracy upto 10-bit width. The computation engine consists of an systolic array of 1024 processing elements (PE). Large CNNs such as AlexNet, MobileNet, SqueezeNet and EfficientNet were used as benchmark CNN model and Virtex Ultrascale FPGA was used to test the architecture. The proposed truncation scheme has 49% power reduction and resource utilization was reduced by 73.25% for LUTs (Look-up tables), 68.76% for FFs (Flip-Flops), 74.60% for BRAMs (Block RAMs) and 79.425% for Digital Signal Processors (DSPs) when compared with the 32 bits architecture. The design achieves a performance of 223.69 GOPS on a Virtex Ultrascale FPGA, the design has a overall gain of 3.63 × throughput when compared to other prior FPGA accelerators. In addition, the overall power consumption is 4.5 × lower when compared to other prior architectures. The ASIC version of the accelerator was designed in 22nm FDSOI CMOS process to achieve a overall throughput of 2.03 TOPS/W with a total power consumption of 791 mW and with a area of 1 mm × 1.2 mm.https://ieeexplore.ieee.org/document/9335311/Deep neural networkAlexNetMobileNetSqueezeNetEfficientNettruncation
spellingShingle	Suhas Shivapakash Hardik Jain Olaf Hellwich Friedel Gerfers A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks IEEE Open Journal of Circuits and Systems Deep neural network AlexNet MobileNet SqueezeNet EfficientNet truncation
title	A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks
title_full	A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks
title_fullStr	A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks
title_full_unstemmed	A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks
title_short	A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks
title_sort	power efficiency enhancements of a multi bit accelerator for memory prohibitive deep neural networks
topic	Deep neural network AlexNet MobileNet SqueezeNet EfficientNet truncation
url	https://ieeexplore.ieee.org/document/9335311/
work_keys_str_mv	AT suhasshivapakash apowerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks AT hardikjain apowerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks AT olafhellwich apowerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks AT friedelgerfers apowerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks AT suhasshivapakash powerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks AT hardikjain powerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks AT olafhellwich powerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks AT friedelgerfers powerefficiencyenhancementsofamultibitacceleratorformemoryprohibitivedeepneuralnetworks

A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks

Similar Items