Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression

Current edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different fro...

Full description

Bibliographic Details
Main Authors:	Wenxin Yang, Xiaoli Zhi, Weiqin Tong
Format:	Article
Language:	English
Published:	MDPI AG 2023-01-01
Series:	Algorithms
Subjects:	deep neural network quantized neural networks particle swarm optimization clustering
Online Access:	https://www.mdpi.com/1999-4893/16/1/31

_version_	1797447127575560192
author	Wenxin Yang Xiaoli Zhi Weiqin Tong
author_facet	Wenxin Yang Xiaoli Zhi Weiqin Tong
author_sort	Wenxin Yang
collection	DOAJ
description	Current edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different from previous studies that address this problem by clipping the outliers, this paper proposes a two-stage quantization method. Before converting the weights into fixed-point numbers, this paper first prunes the network by unstructured pruning and then uses the K-means algorithm to cluster the weights in advance to protect the distribution of the weights. To solve the instability problem of the K-means results, the PSO (particle swarm optimization) algorithm is exploited to obtain the initial cluster centroids. The experimental results on baseline deep networks such as ResNet-50, Inception-v3, and DenseNet-121 show the proposed optimized quantization method can generate a 5-bit network with an accuracy loss of less than 5% and a 4-bit network with only 10% accuracy loss as compared to 8-bit quantization. By quantization and pruning, this method reduces the model bit-width from 32 to 4 and the number of neurons by 80%. Additionally, it can be easily integrated into frameworks such as TensorRt and TensorFlow-Lite for low bit-width network quantization.
first_indexed	2024-03-09T13:50:21Z
format	Article
id	doaj.art-a3e5798090544e1c903407735e1563be
institution	Directory Open Access Journal
issn	1999-4893
language	English
last_indexed	2024-03-09T13:50:21Z
publishDate	2023-01-01
publisher	MDPI AG
record_format	Article
series	Algorithms
spelling	doaj.art-a3e5798090544e1c903407735e1563be2023-11-30T20:51:26ZengMDPI AGAlgorithms1999-48932023-01-011613110.3390/a16010031Optimization of Linear Quantization for General and Effective Low Bit-Width Network CompressionWenxin Yang0Xiaoli Zhi1Weiqin Tong2School of Computer Engineering & Science, Shanghai University, Shanghai 200444, ChinaShanghai Engineering Research Center of Intelligent Computing System, Shanghai University, Shanghai 200444, ChinaShanghai Engineering Research Center of Intelligent Computing System, Shanghai University, Shanghai 200444, ChinaCurrent edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different from previous studies that address this problem by clipping the outliers, this paper proposes a two-stage quantization method. Before converting the weights into fixed-point numbers, this paper first prunes the network by unstructured pruning and then uses the K-means algorithm to cluster the weights in advance to protect the distribution of the weights. To solve the instability problem of the K-means results, the PSO (particle swarm optimization) algorithm is exploited to obtain the initial cluster centroids. The experimental results on baseline deep networks such as ResNet-50, Inception-v3, and DenseNet-121 show the proposed optimized quantization method can generate a 5-bit network with an accuracy loss of less than 5% and a 4-bit network with only 10% accuracy loss as compared to 8-bit quantization. By quantization and pruning, this method reduces the model bit-width from 32 to 4 and the number of neurons by 80%. Additionally, it can be easily integrated into frameworks such as TensorRt and TensorFlow-Lite for low bit-width network quantization.https://www.mdpi.com/1999-4893/16/1/31deep neural networkquantized neural networksparticle swarm optimizationclustering
spellingShingle	Wenxin Yang Xiaoli Zhi Weiqin Tong Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression Algorithms deep neural network quantized neural networks particle swarm optimization clustering
title	Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_full	Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_fullStr	Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_full_unstemmed	Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_short	Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_sort	optimization of linear quantization for general and effective low bit width network compression
topic	deep neural network quantized neural networks particle swarm optimization clustering
url	https://www.mdpi.com/1999-4893/16/1/31
work_keys_str_mv	AT wenxinyang optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression AT xiaolizhi optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression AT weiqintong optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression

Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression

Similar Items