Performance Evaluation of INT8 Quantized Inference on Mobile GPUs

During the past several years, the need for on-device deep learning has rapidly increased, and the performance of mobile GPUs has significantly increased. As a viable approach for efficient on-device deep learning, INT8 quantized inference has been actively studied and proposed but there are current...

Full description

Bibliographic Details
Main Authors:	Sumin Kim, Gunju Park, Youngmin Yi
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	On-device deep learning INT8 quantization INT8 Winograd convolution mobile GPU
Online Access:	https://ieeexplore.ieee.org/document/9638444/

_version_	1819320532254851072
author	Sumin Kim Gunju Park Youngmin Yi
author_facet	Sumin Kim Gunju Park Youngmin Yi
author_sort	Sumin Kim
collection	DOAJ
description	During the past several years, the need for on-device deep learning has rapidly increased, and the performance of mobile GPUs has significantly increased. As a viable approach for efficient on-device deep learning, INT8 quantized inference has been actively studied and proposed but there are currently few frameworks that support INT8 quantization for mobile GPUs. This paper presents a unified framework that integrates various INT8 quantization methods, such as symmetric, asymmetric, per-layer, and per-channel, and discusses their impact on accuracy and efficiency on recent mobile GPUs. Moreover, we discuss the performance and accuracy of INT8 quantized Winograd convolution and propose INT8 Winograd convolution with F(<inline-formula> <tex-math notation="LaTeX">$2\times 2$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$3\times 3$ </tex-math></inline-formula>), where weight tensors are quantized in INT4 and input tensors are quantized in INT6. We evaluated the performance of INT8 methods, including INT8 Winograd, for ResNet50, MobileNet-v1, and VGG16 on Mali G52, G72, and G76 GPUs on Odroid N2, Galaxy S9, and Galaxy Note 10+, respectively. INT8 quantized inference based on General Matrix Multiplication (GEMM) was <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> faster than FP32 GEMM for ResNet50 on Mali G52, and was further accelerated by batch normalization folding and by the proposed INT8 Winograd convolution, achieving <inline-formula> <tex-math notation="LaTeX">$2.45\times $ </tex-math></inline-formula> speedup in total with an accuracy loss of only 0.31%p.
first_indexed	2024-12-24T11:21:04Z
format	Article
id	doaj.art-a9f8d8e79c5a430a98dbfac864894da4
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-24T11:21:04Z
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-a9f8d8e79c5a430a98dbfac864894da42022-12-21T16:58:14ZengIEEEIEEE Access2169-35362021-01-01916424516425510.1109/ACCESS.2021.31331009638444Performance Evaluation of INT8 Quantized Inference on Mobile GPUsSumin Kim0https://orcid.org/0000-0001-7747-2143Gunju Park1https://orcid.org/0000-0002-6734-8648Youngmin Yi2https://orcid.org/0000-0001-9802-2109Department of Electrical and Computer Engineering, University of Seoul, Dongdaemun-gu, Seoul, South KoreaDepartment of Electrical and Computer Engineering, University of Seoul, Dongdaemun-gu, Seoul, South KoreaDepartment of Electrical and Computer Engineering, University of Seoul, Dongdaemun-gu, Seoul, South KoreaDuring the past several years, the need for on-device deep learning has rapidly increased, and the performance of mobile GPUs has significantly increased. As a viable approach for efficient on-device deep learning, INT8 quantized inference has been actively studied and proposed but there are currently few frameworks that support INT8 quantization for mobile GPUs. This paper presents a unified framework that integrates various INT8 quantization methods, such as symmetric, asymmetric, per-layer, and per-channel, and discusses their impact on accuracy and efficiency on recent mobile GPUs. Moreover, we discuss the performance and accuracy of INT8 quantized Winograd convolution and propose INT8 Winograd convolution with F(<inline-formula> <tex-math notation="LaTeX">$2\times 2$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$3\times 3$ </tex-math></inline-formula>), where weight tensors are quantized in INT4 and input tensors are quantized in INT6. We evaluated the performance of INT8 methods, including INT8 Winograd, for ResNet50, MobileNet-v1, and VGG16 on Mali G52, G72, and G76 GPUs on Odroid N2, Galaxy S9, and Galaxy Note 10+, respectively. INT8 quantized inference based on General Matrix Multiplication (GEMM) was <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> faster than FP32 GEMM for ResNet50 on Mali G52, and was further accelerated by batch normalization folding and by the proposed INT8 Winograd convolution, achieving <inline-formula> <tex-math notation="LaTeX">$2.45\times $ </tex-math></inline-formula> speedup in total with an accuracy loss of only 0.31%p.https://ieeexplore.ieee.org/document/9638444/On-device deep learningINT8 quantizationINT8 Winograd convolutionmobile GPU
spellingShingle	Sumin Kim Gunju Park Youngmin Yi Performance Evaluation of INT8 Quantized Inference on Mobile GPUs IEEE Access On-device deep learning INT8 quantization INT8 Winograd convolution mobile GPU
title	Performance Evaluation of INT8 Quantized Inference on Mobile GPUs
title_full	Performance Evaluation of INT8 Quantized Inference on Mobile GPUs
title_fullStr	Performance Evaluation of INT8 Quantized Inference on Mobile GPUs
title_full_unstemmed	Performance Evaluation of INT8 Quantized Inference on Mobile GPUs
title_short	Performance Evaluation of INT8 Quantized Inference on Mobile GPUs
title_sort	performance evaluation of int8 quantized inference on mobile gpus
topic	On-device deep learning INT8 quantization INT8 Winograd convolution mobile GPU
url	https://ieeexplore.ieee.org/document/9638444/
work_keys_str_mv	AT suminkim performanceevaluationofint8quantizedinferenceonmobilegpus AT gunjupark performanceevaluationofint8quantizedinferenceonmobilegpus AT youngminyi performanceevaluationofint8quantizedinferenceonmobilegpus

Performance Evaluation of INT8 Quantized Inference on Mobile GPUs

Similar Items