Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle

This paper presents a comprehensive hardware accelerator architecture of YOLOv3-Tiny targeted for low-cost FPGA with a high frame rate, high accuracy, and low latency. The proposed accelerator implements all YOLO layers in hardware including zero pad layer, convolution layer, leaky ReLU layer, batch...

Full description

Bibliographic Details
Main Authors: Trio Adiono, Adiwena Putra, Nana Sutisna, Infall Syafalni, Rahmat Mulyawan
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9576092/
Description
Summary:This paper presents a comprehensive hardware accelerator architecture of YOLOv3-Tiny targeted for low-cost FPGA with a high frame rate, high accuracy, and low latency. The proposed accelerator implements all YOLO layers in hardware including zero pad layer, convolution layer, leaky ReLU layer, batch normalization layer, max-pooling layer, and up-sampling layer. The architecture is built based on data flow and control flow hybrid architecture. The data preparation and computation process work asynchronously using the data flow paradigm, while the overall governing process is controlled by proposed custom instruction set which adopts the principle of control flow paradigm. The principle of General Matrix Multiplication (GEMM) is adopted to compute the convolution process. We designed a GEMM processor using an optimum size of systolic array architecture. The systolic core is small and the overall architecture supports the multicore system, making it scalable to be implemented on larger size FPGAs. We also proposed a hardware architecture for mapping feature maps into matrix form for GEMM convolution which can save on-chip memory space. Lastly, we modified the original YOLO algorithm to further optimize it in our hardware. The modification includes reducing the bit precision to reduce memory space and bandwidth requirement, merging the normalization layer with the convolution layer to reduce arithmetic complexity, and adding a new DLQ layer to keep the bit size small while maintaining the accuracy. Based on the experimental results, our proposed design manages to achieve a frame rate of 8.3 FPS with the throughput of 31.5 GOPS, outperforming the same convolution computation that is performed by Ryzen 5 3600 CPU up to <inline-formula> <tex-math notation="LaTeX">$69.3\times $ </tex-math></inline-formula> in latency. Moreover, our proposed design also has the smallest clock cycle ratio up to <inline-formula> <tex-math notation="LaTeX">$1.75\times $ </tex-math></inline-formula> than other commercial accelerators. The system is useful and suitable for edge computing applications.
ISSN:2169-3536