Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

As a key ingredient of deep neural networks (DNNs), fully-connected (FC) layers are widely used in various artificial intelligence applications. However, there are many parameters in FC layers, so the efficient process of FC layers is restricted by memory bandwidth. In this paper, we propose a compr...

Full description

Bibliographic Details
Main Authors: Zidi Qin, Di Zhu, Xingwei Zhu, Xuan Chen, Yinghuan Shi, Yang Gao, Zhonghai Lu, Qinghong Shen, Li Li, Hongbing Pan
Format: Article
Language:English
Published: MDPI AG 2019-01-01
Series:Electronics
Subjects:
Online Access:http://www.mdpi.com/2079-9292/8/1/78
_version_ 1798040135601750016
author Zidi Qin
Di Zhu
Xingwei Zhu
Xuan Chen
Yinghuan Shi
Yang Gao
Zhonghai Lu
Qinghong Shen
Li Li
Hongbing Pan
author_facet Zidi Qin
Di Zhu
Xingwei Zhu
Xuan Chen
Yinghuan Shi
Yang Gao
Zhonghai Lu
Qinghong Shen
Li Li
Hongbing Pan
author_sort Zidi Qin
collection DOAJ
description As a key ingredient of deep neural networks (DNNs), fully-connected (FC) layers are widely used in various artificial intelligence applications. However, there are many parameters in FC layers, so the efficient process of FC layers is restricted by memory bandwidth. In this paper, we propose a compression approach combining block-circulant matrix-based weight representation and power-of-two quantization. Applying block-circulant matrices in FC layers can reduce the storage complexity from O ( k 2 ) to O ( k ) . By quantizing the weights into integer powers of two, the multiplications in the reference can be replaced by shift and add operations. The memory usages of models for MNIST, CIFAR-10 and ImageNet can be compressed by 171 × , 2731 × and 128 × with minimal accuracy loss, respectively. A configurable parallel hardware architecture is then proposed for processing the compressed FC layers efficiently. Without multipliers, a block matrix-vector multiplication module (B-MV) is used as the computing kernel. The architecture is flexible to support FC layers of various compression ratios with small footprint. Simultaneously, the memory access can be significantly reduced by using the configurable architecture. Measurement results show that the accelerator has a processing power of 409.6 GOPS, and achieves 5.3 TOPS/W energy efficiency at 800 MHz.
first_indexed 2024-04-11T22:03:21Z
format Article
id doaj.art-9517deabc4a34602b1cafad1feaceba5
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-04-11T22:03:21Z
publishDate 2019-01-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-9517deabc4a34602b1cafad1feaceba52022-12-22T04:00:48ZengMDPI AGElectronics2079-92922019-01-01817810.3390/electronics8010078electronics8010078Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision WeightsZidi Qin0Di Zhu1Xingwei Zhu2Xuan Chen3Yinghuan Shi4Yang Gao5Zhonghai Lu6Qinghong Shen7Li Li8Hongbing Pan9School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, ChinaSchool of Electronic Science and Engineering, Nanjing University, Nanjing 210023, ChinaSchool of Electronic Science and Engineering, Nanjing University, Nanjing 210023, ChinaSchool of Electronic Science and Engineering, Nanjing University, Nanjing 210023, ChinaState Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, ChinaState Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, ChinaSchool of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 114 28 Stockholm, SwedenSchool of Electronic Science and Engineering, Nanjing University, Nanjing 210023, ChinaSchool of Electronic Science and Engineering, Nanjing University, Nanjing 210023, ChinaSchool of Electronic Science and Engineering, Nanjing University, Nanjing 210023, ChinaAs a key ingredient of deep neural networks (DNNs), fully-connected (FC) layers are widely used in various artificial intelligence applications. However, there are many parameters in FC layers, so the efficient process of FC layers is restricted by memory bandwidth. In this paper, we propose a compression approach combining block-circulant matrix-based weight representation and power-of-two quantization. Applying block-circulant matrices in FC layers can reduce the storage complexity from O ( k 2 ) to O ( k ) . By quantizing the weights into integer powers of two, the multiplications in the reference can be replaced by shift and add operations. The memory usages of models for MNIST, CIFAR-10 and ImageNet can be compressed by 171 × , 2731 × and 128 × with minimal accuracy loss, respectively. A configurable parallel hardware architecture is then proposed for processing the compressed FC layers efficiently. Without multipliers, a block matrix-vector multiplication module (B-MV) is used as the computing kernel. The architecture is flexible to support FC layers of various compression ratios with small footprint. Simultaneously, the memory access can be significantly reduced by using the configurable architecture. Measurement results show that the accelerator has a processing power of 409.6 GOPS, and achieves 5.3 TOPS/W energy efficiency at 800 MHz.http://www.mdpi.com/2079-9292/8/1/78hardware accelerationdeep neural networks (DNNs)fully-connected layersnetwork compressionVLSI
spellingShingle Zidi Qin
Di Zhu
Xingwei Zhu
Xuan Chen
Yinghuan Shi
Yang Gao
Zhonghai Lu
Qinghong Shen
Li Li
Hongbing Pan
Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
Electronics
hardware acceleration
deep neural networks (DNNs)
fully-connected layers
network compression
VLSI
title Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
title_full Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
title_fullStr Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
title_full_unstemmed Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
title_short Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
title_sort accelerating deep neural networks by combining block circulant matrices and low precision weights
topic hardware acceleration
deep neural networks (DNNs)
fully-connected layers
network compression
VLSI
url http://www.mdpi.com/2079-9292/8/1/78
work_keys_str_mv AT zidiqin acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT dizhu acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT xingweizhu acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT xuanchen acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT yinghuanshi acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT yanggao acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT zhonghailu acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT qinghongshen acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT lili acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights
AT hongbingpan acceleratingdeepneuralnetworksbycombiningblockcirculantmatricesandlowprecisionweights