Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression

Current edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different fro...

Full description

Bibliographic Details
Main Authors: Wenxin Yang, Xiaoli Zhi, Weiqin Tong
Format: Article
Language:English
Published: MDPI AG 2023-01-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/16/1/31
_version_ 1797447127575560192
author Wenxin Yang
Xiaoli Zhi
Weiqin Tong
author_facet Wenxin Yang
Xiaoli Zhi
Weiqin Tong
author_sort Wenxin Yang
collection DOAJ
description Current edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different from previous studies that address this problem by clipping the outliers, this paper proposes a two-stage quantization method. Before converting the weights into fixed-point numbers, this paper first prunes the network by unstructured pruning and then uses the K-means algorithm to cluster the weights in advance to protect the distribution of the weights. To solve the instability problem of the K-means results, the PSO (particle swarm optimization) algorithm is exploited to obtain the initial cluster centroids. The experimental results on baseline deep networks such as ResNet-50, Inception-v3, and DenseNet-121 show the proposed optimized quantization method can generate a 5-bit network with an accuracy loss of less than 5% and a 4-bit network with only 10% accuracy loss as compared to 8-bit quantization. By quantization and pruning, this method reduces the model bit-width from 32 to 4 and the number of neurons by 80%. Additionally, it can be easily integrated into frameworks such as TensorRt and TensorFlow-Lite for low bit-width network quantization.
first_indexed 2024-03-09T13:50:21Z
format Article
id doaj.art-a3e5798090544e1c903407735e1563be
institution Directory Open Access Journal
issn 1999-4893
language English
last_indexed 2024-03-09T13:50:21Z
publishDate 2023-01-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj.art-a3e5798090544e1c903407735e1563be2023-11-30T20:51:26ZengMDPI AGAlgorithms1999-48932023-01-011613110.3390/a16010031Optimization of Linear Quantization for General and Effective Low Bit-Width Network CompressionWenxin Yang0Xiaoli Zhi1Weiqin Tong2School of Computer Engineering & Science, Shanghai University, Shanghai 200444, ChinaShanghai Engineering Research Center of Intelligent Computing System, Shanghai University, Shanghai 200444, ChinaShanghai Engineering Research Center of Intelligent Computing System, Shanghai University, Shanghai 200444, ChinaCurrent edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different from previous studies that address this problem by clipping the outliers, this paper proposes a two-stage quantization method. Before converting the weights into fixed-point numbers, this paper first prunes the network by unstructured pruning and then uses the K-means algorithm to cluster the weights in advance to protect the distribution of the weights. To solve the instability problem of the K-means results, the PSO (particle swarm optimization) algorithm is exploited to obtain the initial cluster centroids. The experimental results on baseline deep networks such as ResNet-50, Inception-v3, and DenseNet-121 show the proposed optimized quantization method can generate a 5-bit network with an accuracy loss of less than 5% and a 4-bit network with only 10% accuracy loss as compared to 8-bit quantization. By quantization and pruning, this method reduces the model bit-width from 32 to 4 and the number of neurons by 80%. Additionally, it can be easily integrated into frameworks such as TensorRt and TensorFlow-Lite for low bit-width network quantization.https://www.mdpi.com/1999-4893/16/1/31deep neural networkquantized neural networksparticle swarm optimizationclustering
spellingShingle Wenxin Yang
Xiaoli Zhi
Weiqin Tong
Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
Algorithms
deep neural network
quantized neural networks
particle swarm optimization
clustering
title Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_full Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_fullStr Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_full_unstemmed Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_short Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
title_sort optimization of linear quantization for general and effective low bit width network compression
topic deep neural network
quantized neural networks
particle swarm optimization
clustering
url https://www.mdpi.com/1999-4893/16/1/31
work_keys_str_mv AT wenxinyang optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression
AT xiaolizhi optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression
AT weiqintong optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression