A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

Depthwise separable convolution (DSC) significantly reduces parameter and floating operations with an acceptable loss of accuracy and has been widely used in various lightweight convolutional neural network (CNN) models. In practical applications, however, DSC accelerators based on graphics processi...

Full description

Bibliographic Details
Main Authors: Jiye Huang, Xin Liu, Tongdong Guo, Zhijin Zhao
Format: Article
Language:English
Published: MDPI AG 2023-03-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/12/7/1571
_version_ 1797608108357320704
author Jiye Huang
Xin Liu
Tongdong Guo
Zhijin Zhao
author_facet Jiye Huang
Xin Liu
Tongdong Guo
Zhijin Zhao
author_sort Jiye Huang
collection DOAJ
description Depthwise separable convolution (DSC) significantly reduces parameter and floating operations with an acceptable loss of accuracy and has been widely used in various lightweight convolutional neural network (CNN) models. In practical applications, however, DSC accelerators based on graphics processing units (GPUs) cannot fully exploit the performance of DSC and are unsuitable for mobile application scenarios. Moreover, low resource utilization due to idle engines is a common problem in DSC accelerator design. In this paper, a high-performance DSC hardware accelerator based on field-programmable gate arrays (FPGAs) is proposed. A highly reusable and scalable multiplication and accumulation engine is proposed to improve the utilization of computational resources. An efficient convolution algorithm is proposed for depthwise convolution (DWC) and pointwise convolution (PWC), respectively, to reduce the on-chip memory occupancy. Meanwhile, the proposed convolution algorithms achieve partial fusion between PWC and DWC, and improve the off-chip memory access efficiency. To maximise bandwidth utilization and reduce latency when reading feature maps, an address mapping method for off-chip accesses is proposed. The performance of the proposed accelerator is demonstrated by implementing MobileNetV2 on an Intel Arria 10 GX660 FPGA by using Verilog HDL. The experimental results show that the proposed DSC accelerator achieves a performance of 205.1 FPS, 128.8 GFLOPS, and 0.24 GOPS/DSP for input images of size <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>224</mn><mo>×</mo><mn>224</mn><mo>×</mo><mn>3</mn></mrow></semantics></math></inline-formula>.
first_indexed 2024-03-11T05:40:04Z
format Article
id doaj.art-9f94790224c24541be1f5a770e6a71b8
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-11T05:40:04Z
publishDate 2023-03-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-9f94790224c24541be1f5a770e6a71b82023-11-17T16:32:29ZengMDPI AGElectronics2079-92922023-03-01127157110.3390/electronics12071571A High-Performance FPGA-Based Depthwise Separable Convolution AcceleratorJiye Huang0Xin Liu1Tongdong Guo2Zhijin Zhao3The School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, ChinaThe School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, ChinaThe School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, ChinaThe School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, ChinaDepthwise separable convolution (DSC) significantly reduces parameter and floating operations with an acceptable loss of accuracy and has been widely used in various lightweight convolutional neural network (CNN) models. In practical applications, however, DSC accelerators based on graphics processing units (GPUs) cannot fully exploit the performance of DSC and are unsuitable for mobile application scenarios. Moreover, low resource utilization due to idle engines is a common problem in DSC accelerator design. In this paper, a high-performance DSC hardware accelerator based on field-programmable gate arrays (FPGAs) is proposed. A highly reusable and scalable multiplication and accumulation engine is proposed to improve the utilization of computational resources. An efficient convolution algorithm is proposed for depthwise convolution (DWC) and pointwise convolution (PWC), respectively, to reduce the on-chip memory occupancy. Meanwhile, the proposed convolution algorithms achieve partial fusion between PWC and DWC, and improve the off-chip memory access efficiency. To maximise bandwidth utilization and reduce latency when reading feature maps, an address mapping method for off-chip accesses is proposed. The performance of the proposed accelerator is demonstrated by implementing MobileNetV2 on an Intel Arria 10 GX660 FPGA by using Verilog HDL. The experimental results show that the proposed DSC accelerator achieves a performance of 205.1 FPS, 128.8 GFLOPS, and 0.24 GOPS/DSP for input images of size <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>224</mn><mo>×</mo><mn>224</mn><mo>×</mo><mn>3</mn></mrow></semantics></math></inline-formula>.https://www.mdpi.com/2079-9292/12/7/1571convolutional neural networkdepthwise separable convolutionfield programmable gate arrayhardware acceleratorMobileNetV2
spellingShingle Jiye Huang
Xin Liu
Tongdong Guo
Zhijin Zhao
A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
Electronics
convolutional neural network
depthwise separable convolution
field programmable gate array
hardware accelerator
MobileNetV2
title A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
title_full A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
title_fullStr A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
title_full_unstemmed A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
title_short A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
title_sort high performance fpga based depthwise separable convolution accelerator
topic convolutional neural network
depthwise separable convolution
field programmable gate array
hardware accelerator
MobileNetV2
url https://www.mdpi.com/2079-9292/12/7/1571
work_keys_str_mv AT jiyehuang ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator
AT xinliu ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator
AT tongdongguo ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator
AT zhijinzhao ahighperformancefpgabaseddepthwiseseparableconvolutionaccelerator
AT jiyehuang highperformancefpgabaseddepthwiseseparableconvolutionaccelerator
AT xinliu highperformancefpgabaseddepthwiseseparableconvolutionaccelerator
AT tongdongguo highperformancefpgabaseddepthwiseseparableconvolutionaccelerator
AT zhijinzhao highperformancefpgabaseddepthwiseseparableconvolutionaccelerator