Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used techn...

Full description

Bibliographic Details
Main Authors:	Mirco Mannino, Biagio Peccerillo, Andrea Mondelli, Sandro Bartolini
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Convolutional neural networks direct convolution multi-core multi-threading performance evaluation
Online Access:	https://ieeexplore.ieee.org/document/10144741/

_version_	1797743005881335808
author	Mirco Mannino Biagio Peccerillo Andrea Mondelli Sandro Bartolini
author_facet	Mirco Mannino Biagio Peccerillo Andrea Mondelli Sandro Bartolini
author_sort	Mirco Mannino
collection	DOAJ
description	Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, <italic>direct convolution</italic> is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup) than matrix-matrix multiplication-based convolution in a multi-core system.
first_indexed	2024-03-12T14:48:22Z
format	Article
id	doaj.art-f70d874594ac4a1d8f71d34fb6d3d5c3
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-12T14:48:22Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-f70d874594ac4a1d8f71d34fb6d3d5c32023-08-15T23:00:21ZengIEEEIEEE Access2169-35362023-01-0111575145752810.1109/ACCESS.2023.328331210144741Analysis and Optimization of Direct Convolution Execution on Multi-Core ProcessorsMirco Mannino0https://orcid.org/0000-0003-1660-3984Biagio Peccerillo1https://orcid.org/0000-0002-4998-0092Andrea Mondelli2Sandro Bartolini3https://orcid.org/0000-0002-7975-3632Department of Information Engineering and Mathematics, University of Siena, Siena, ItalyDepartment of Information Engineering and Mathematics, University of Siena, Siena, ItalyHuawei Technologies Company Ltd., Shenzhen, ChinaDepartment of Information Engineering and Mathematics, University of Siena, Siena, ItalyNowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, <italic>direct convolution</italic> is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup) than matrix-matrix multiplication-based convolution in a multi-core system.https://ieeexplore.ieee.org/document/10144741/Convolutional neural networksdirect convolutionmulti-coremulti-threadingperformance evaluation
spellingShingle	Mirco Mannino Biagio Peccerillo Andrea Mondelli Sandro Bartolini Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors IEEE Access Convolutional neural networks direct convolution multi-core multi-threading performance evaluation
title	Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_full	Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_fullStr	Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_full_unstemmed	Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_short	Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_sort	analysis and optimization of direct convolution execution on multi core processors
topic	Convolutional neural networks direct convolution multi-core multi-threading performance evaluation
url	https://ieeexplore.ieee.org/document/10144741/
work_keys_str_mv	AT mircomannino analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors AT biagiopeccerillo analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors AT andreamondelli analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors AT sandrobartolini analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors

Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

Similar Items