ORSAS: An Output Row-Stationary Accelerator for Sparse Neural Networks
Various pruning skills and network compression methods make modern neural networks sparse for both weights and activations. However, GPUs (graphics processing units) and most of the customized CNN (convolutional neural networks) accelerators have not taken advantage of the sparsity of neural network...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10114404/ |
_version_ | 1797805960756985856 |
---|---|
author | Chenxiao Lin Yuezhong Liu Delong Shang |
author_facet | Chenxiao Lin Yuezhong Liu Delong Shang |
author_sort | Chenxiao Lin |
collection | DOAJ |
description | Various pruning skills and network compression methods make modern neural networks sparse for both weights and activations. However, GPUs (graphics processing units) and most of the customized CNN (convolutional neural networks) accelerators have not taken advantage of the sparsity of neural networks, and the accelerators for sparse neural networks in these years are suffering from low computational resource utilization. This paper first proposes an output row-stationary dataflow that exploits the sparsity of both weights and activations. It allows the accelerator to process weights and activations in their compressed form, leading to the high utilization of computational resources, i.e. multipliers. Besides, a low-cost compression algorithm is adopted for both weights and input activations to reduce the power consumption of data access. Secondly, the Y-buffer is proposed to eliminate repeated reading of input activations caused by halo effects, which emerge when tiling large-size input feature maps (ifmaps). Third, an interleaved broadcasting mechanism is introduced to alleviate the imbalance problems caused by the irregularity of sparse data. Finally, a prototype design called ORSAS is synthesized and placed and routed with a SMIC 55 nm process. The evaluation results show that ORSAS occupies the smallest logical cell area, and has the highest multiplier utilization among the peers’ works when the sparsity is large. ORSAS keeps the utilization of multipliers in a range from 60% to 90% over the convolutional layers of popular sparse CNNs and achieves ultra-low power consumption and highest efficiency. |
first_indexed | 2024-03-13T06:00:40Z |
format | Article |
id | doaj.art-f00b16d277334edf9c8515fa2d510068 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-13T06:00:40Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-f00b16d277334edf9c8515fa2d5100682023-06-12T23:01:42ZengIEEEIEEE Access2169-35362023-01-0111441234413510.1109/ACCESS.2023.327256410114404ORSAS: An Output Row-Stationary Accelerator for Sparse Neural NetworksChenxiao Lin0https://orcid.org/0000-0001-8109-5328Yuezhong Liu1https://orcid.org/0009-0007-1875-7262Delong Shang2Institute of Microelectronics of the Chinese Academy of Sciences, Beijing, Chaoyang, ChinaInstitute of Microelectronics of the Chinese Academy of Sciences, Beijing, Chaoyang, ChinaInstitute of Microelectronics of the Chinese Academy of Sciences, Beijing, Chaoyang, ChinaVarious pruning skills and network compression methods make modern neural networks sparse for both weights and activations. However, GPUs (graphics processing units) and most of the customized CNN (convolutional neural networks) accelerators have not taken advantage of the sparsity of neural networks, and the accelerators for sparse neural networks in these years are suffering from low computational resource utilization. This paper first proposes an output row-stationary dataflow that exploits the sparsity of both weights and activations. It allows the accelerator to process weights and activations in their compressed form, leading to the high utilization of computational resources, i.e. multipliers. Besides, a low-cost compression algorithm is adopted for both weights and input activations to reduce the power consumption of data access. Secondly, the Y-buffer is proposed to eliminate repeated reading of input activations caused by halo effects, which emerge when tiling large-size input feature maps (ifmaps). Third, an interleaved broadcasting mechanism is introduced to alleviate the imbalance problems caused by the irregularity of sparse data. Finally, a prototype design called ORSAS is synthesized and placed and routed with a SMIC 55 nm process. The evaluation results show that ORSAS occupies the smallest logical cell area, and has the highest multiplier utilization among the peers’ works when the sparsity is large. ORSAS keeps the utilization of multipliers in a range from 60% to 90% over the convolutional layers of popular sparse CNNs and achieves ultra-low power consumption and highest efficiency.https://ieeexplore.ieee.org/document/10114404/Convolution dataflowdata compressionimbalancehalo effectshigh utilizationsparse neural network accelerator |
spellingShingle | Chenxiao Lin Yuezhong Liu Delong Shang ORSAS: An Output Row-Stationary Accelerator for Sparse Neural Networks IEEE Access Convolution dataflow data compression imbalance halo effects high utilization sparse neural network accelerator |
title | ORSAS: An Output Row-Stationary Accelerator for Sparse Neural Networks |
title_full | ORSAS: An Output Row-Stationary Accelerator for Sparse Neural Networks |
title_fullStr | ORSAS: An Output Row-Stationary Accelerator for Sparse Neural Networks |
title_full_unstemmed | ORSAS: An Output Row-Stationary Accelerator for Sparse Neural Networks |
title_short | ORSAS: An Output Row-Stationary Accelerator for Sparse Neural Networks |
title_sort | orsas an output row stationary accelerator for sparse neural networks |
topic | Convolution dataflow data compression imbalance halo effects high utilization sparse neural network accelerator |
url | https://ieeexplore.ieee.org/document/10114404/ |
work_keys_str_mv | AT chenxiaolin orsasanoutputrowstationaryacceleratorforsparseneuralnetworks AT yuezhongliu orsasanoutputrowstationaryacceleratorforsparseneuralnetworks AT delongshang orsasanoutputrowstationaryacceleratorforsparseneuralnetworks |