Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression
Current edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different fro...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-01-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/16/1/31 |
_version_ | 1797447127575560192 |
---|---|
author | Wenxin Yang Xiaoli Zhi Weiqin Tong |
author_facet | Wenxin Yang Xiaoli Zhi Weiqin Tong |
author_sort | Wenxin Yang |
collection | DOAJ |
description | Current edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different from previous studies that address this problem by clipping the outliers, this paper proposes a two-stage quantization method. Before converting the weights into fixed-point numbers, this paper first prunes the network by unstructured pruning and then uses the K-means algorithm to cluster the weights in advance to protect the distribution of the weights. To solve the instability problem of the K-means results, the PSO (particle swarm optimization) algorithm is exploited to obtain the initial cluster centroids. The experimental results on baseline deep networks such as ResNet-50, Inception-v3, and DenseNet-121 show the proposed optimized quantization method can generate a 5-bit network with an accuracy loss of less than 5% and a 4-bit network with only 10% accuracy loss as compared to 8-bit quantization. By quantization and pruning, this method reduces the model bit-width from 32 to 4 and the number of neurons by 80%. Additionally, it can be easily integrated into frameworks such as TensorRt and TensorFlow-Lite for low bit-width network quantization. |
first_indexed | 2024-03-09T13:50:21Z |
format | Article |
id | doaj.art-a3e5798090544e1c903407735e1563be |
institution | Directory Open Access Journal |
issn | 1999-4893 |
language | English |
last_indexed | 2024-03-09T13:50:21Z |
publishDate | 2023-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Algorithms |
spelling | doaj.art-a3e5798090544e1c903407735e1563be2023-11-30T20:51:26ZengMDPI AGAlgorithms1999-48932023-01-011613110.3390/a16010031Optimization of Linear Quantization for General and Effective Low Bit-Width Network CompressionWenxin Yang0Xiaoli Zhi1Weiqin Tong2School of Computer Engineering & Science, Shanghai University, Shanghai 200444, ChinaShanghai Engineering Research Center of Intelligent Computing System, Shanghai University, Shanghai 200444, ChinaShanghai Engineering Research Center of Intelligent Computing System, Shanghai University, Shanghai 200444, ChinaCurrent edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different from previous studies that address this problem by clipping the outliers, this paper proposes a two-stage quantization method. Before converting the weights into fixed-point numbers, this paper first prunes the network by unstructured pruning and then uses the K-means algorithm to cluster the weights in advance to protect the distribution of the weights. To solve the instability problem of the K-means results, the PSO (particle swarm optimization) algorithm is exploited to obtain the initial cluster centroids. The experimental results on baseline deep networks such as ResNet-50, Inception-v3, and DenseNet-121 show the proposed optimized quantization method can generate a 5-bit network with an accuracy loss of less than 5% and a 4-bit network with only 10% accuracy loss as compared to 8-bit quantization. By quantization and pruning, this method reduces the model bit-width from 32 to 4 and the number of neurons by 80%. Additionally, it can be easily integrated into frameworks such as TensorRt and TensorFlow-Lite for low bit-width network quantization.https://www.mdpi.com/1999-4893/16/1/31deep neural networkquantized neural networksparticle swarm optimizationclustering |
spellingShingle | Wenxin Yang Xiaoli Zhi Weiqin Tong Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression Algorithms deep neural network quantized neural networks particle swarm optimization clustering |
title | Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression |
title_full | Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression |
title_fullStr | Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression |
title_full_unstemmed | Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression |
title_short | Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression |
title_sort | optimization of linear quantization for general and effective low bit width network compression |
topic | deep neural network quantized neural networks particle swarm optimization clustering |
url | https://www.mdpi.com/1999-4893/16/1/31 |
work_keys_str_mv | AT wenxinyang optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression AT xiaolizhi optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression AT weiqintong optimizationoflinearquantizationforgeneralandeffectivelowbitwidthnetworkcompression |