Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used techn...

Full description

Bibliographic Details
Main Authors: Mirco Mannino, Biagio Peccerillo, Andrea Mondelli, Sandro Bartolini
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10144741/
_version_ 1797743005881335808
author Mirco Mannino
Biagio Peccerillo
Andrea Mondelli
Sandro Bartolini
author_facet Mirco Mannino
Biagio Peccerillo
Andrea Mondelli
Sandro Bartolini
author_sort Mirco Mannino
collection DOAJ
description Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, <italic>direct convolution</italic> is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70&#x0025; compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup) than matrix-matrix multiplication-based convolution in a multi-core system.
first_indexed 2024-03-12T14:48:22Z
format Article
id doaj.art-f70d874594ac4a1d8f71d34fb6d3d5c3
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-12T14:48:22Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-f70d874594ac4a1d8f71d34fb6d3d5c32023-08-15T23:00:21ZengIEEEIEEE Access2169-35362023-01-0111575145752810.1109/ACCESS.2023.328331210144741Analysis and Optimization of Direct Convolution Execution on Multi-Core ProcessorsMirco Mannino0https://orcid.org/0000-0003-1660-3984Biagio Peccerillo1https://orcid.org/0000-0002-4998-0092Andrea Mondelli2Sandro Bartolini3https://orcid.org/0000-0002-7975-3632Department of Information Engineering and Mathematics, University of Siena, Siena, ItalyDepartment of Information Engineering and Mathematics, University of Siena, Siena, ItalyHuawei Technologies Company Ltd., Shenzhen, ChinaDepartment of Information Engineering and Mathematics, University of Siena, Siena, ItalyNowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, <italic>direct convolution</italic> is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70&#x0025; compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup) than matrix-matrix multiplication-based convolution in a multi-core system.https://ieeexplore.ieee.org/document/10144741/Convolutional neural networksdirect convolutionmulti-coremulti-threadingperformance evaluation
spellingShingle Mirco Mannino
Biagio Peccerillo
Andrea Mondelli
Sandro Bartolini
Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
IEEE Access
Convolutional neural networks
direct convolution
multi-core
multi-threading
performance evaluation
title Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_full Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_fullStr Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_full_unstemmed Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_short Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
title_sort analysis and optimization of direct convolution execution on multi core processors
topic Convolutional neural networks
direct convolution
multi-core
multi-threading
performance evaluation
url https://ieeexplore.ieee.org/document/10144741/
work_keys_str_mv AT mircomannino analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors
AT biagiopeccerillo analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors
AT andreamondelli analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors
AT sandrobartolini analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors