A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
Depthwise separable convolution (DSC) significantly reduces parameter and floating operations with an acceptable loss of accuracy and has been widely used in various lightweight convolutional neural network (CNN) models. In practical applications, however, DSC accelerators based on graphics processi...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-03-01
|
Series: | Electronics |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-9292/12/7/1571 |
_version_ | 1797608108357320704 |
---|---|
author | Jiye Huang Xin Liu Tongdong Guo Zhijin Zhao |
author_facet | Jiye Huang Xin Liu Tongdong Guo Zhijin Zhao |
author_sort | Jiye Huang |
collection | DOAJ |
description | Depthwise separable convolution (DSC) significantly reduces parameter and floating operations with an acceptable loss of accuracy and has been widely used in various lightweight convolutional neural network (CNN) models. In practical applications, however, DSC accelerators based on graphics processing units (GPUs) cannot fully exploit the performance of DSC and are unsuitable for mobile application scenarios. Moreover, low resource utilization due to idle engines is a common problem in DSC accelerator design. In this paper, a high-performance DSC hardware accelerator based on field-programmable gate arrays (FPGAs) is proposed. A highly reusable and scalable multiplication and accumulation engine is proposed to improve the utilization of computational resources. An efficient convolution algorithm is proposed for depthwise convolution (DWC) and pointwise convolution (PWC), respectively, to reduce the on-chip memory occupancy. Meanwhile, the proposed convolution algorithms achieve partial fusion between PWC and DWC, and improve the off-chip memory access efficiency. To maximise bandwidth utilization and reduce latency when reading feature maps, an address mapping method for off-chip accesses is proposed. The performance of the proposed accelerator is demonstrated by implementing MobileNetV2 on an Intel Arria 10 GX660 FPGA by using Verilog HDL. The experimental results show that the proposed DSC accelerator achieves a performance of 205.1 FPS, 128.8 GFLOPS, and 0.24 GOPS/DSP for input images of size <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>224</mn><mo>×</mo><mn>224</mn><mo>×</mo><mn>3</mn></mrow></semantics></math></inline-formula>. |
first_indexed | 2024-03-11T05:40:04Z |
format | Article |
id | doaj.art-9f94790224c24541be1f5a770e6a71b8 |
institution | Directory Open Access Journal |
issn | 2079-9292 |
language | English |
last_indexed | 2024-03-11T05:40:04Z |
publishDate | 2023-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Electronics |
spelling | doaj.art-9f94790224c24541be1f5a770e6a71b82023-11-17T16:32:29ZengMDPI AGElectronics2079-92922023-03-01127157110.3390/electronics12071571A High-Performance FPGA-Based Depthwise Separable Convolution AcceleratorJiye Huang0Xin Liu1Tongdong Guo2Zhijin Zhao3The School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, ChinaThe School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, ChinaThe School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, ChinaThe School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, ChinaDepthwise separable convolution (DSC) significantly reduces parameter and floating operations with an acceptable loss of accuracy and has been widely used in various lightweight convolutional neural network (CNN) models. In practical applications, however, DSC accelerators based on graphics processing units (GPUs) cannot fully exploit the performance of DSC and are unsuitable for mobile application scenarios. Moreover, low resource utilization due to idle engines is a common problem in DSC accelerator design. In this paper, a high-performance DSC hardware accelerator based on field-programmable gate arrays (FPGAs) is proposed. A highly reusable and scalable multiplication and accumulation engine is proposed to improve the utilization of computational resources. An efficient convolution algorithm is proposed for depthwise convolution (DWC) and pointwise convolution (PWC), respectively, to reduce the on-chip memory occupancy. Meanwhile, the proposed convolution algorithms achieve partial fusion between PWC and DWC, and improve the off-chip memory access efficiency. To maximise bandwidth utilization and reduce latency when reading feature maps, an address mapping method for off-chip accesses is proposed. The performance of the proposed accelerator is demonstrated by implementing MobileNetV2 on an Intel Arria 10 GX660 FPGA by using Verilog HDL. The experimental results show that the proposed DSC accelerator achieves a performance of 205.1 FPS, 128.8 GFLOPS, and 0.24 GOPS/DSP for input images of size <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>224</mn><mo>×</mo><mn>224</mn><mo>×</mo><mn>3</mn></mrow></semantics></math></inline-formula>.https://www.mdpi.com/2079-9292/12/7/1571convolutional neural networkdepthwise separable convolutionfield programmable gate arrayhardware acceleratorMobileNetV2 |
spellingShingle | Jiye Huang Xin Liu Tongdong Guo Zhijin Zhao A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator Electronics convolutional neural network depthwise separable convolution field programmable gate array hardware accelerator MobileNetV2 |
title | A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator |
title_full | A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator |
title_fullStr | A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator |
title_full_unstemmed | A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator |
title_short | A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator |
title_sort | high performance fpga based depthwise separable convolution accelerator |
topic | convolutional neural network depthwise separable convolution field programmable gate array hardware accelerator MobileNetV2 |
url | https://www.mdpi.com/2079-9292/12/7/1571 |
work_keys_str_mv | AT jiyehuang ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator AT xinliu ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator AT tongdongguo ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator AT zhijinzhao ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator AT jiyehuang highperformancefpgabaseddepthwiseseparableconvolutionaccelerator AT xinliu highperformancefpgabaseddepthwiseseparableconvolutionaccelerator AT tongdongguo highperformancefpgabaseddepthwiseseparableconvolutionaccelerator AT zhijinzhao highperformancefpgabaseddepthwiseseparableconvolutionaccelerator |