Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used techn...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10144741/ |
_version_ | 1797743005881335808 |
---|---|
author | Mirco Mannino Biagio Peccerillo Andrea Mondelli Sandro Bartolini |
author_facet | Mirco Mannino Biagio Peccerillo Andrea Mondelli Sandro Bartolini |
author_sort | Mirco Mannino |
collection | DOAJ |
description | Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, <italic>direct convolution</italic> is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup) than matrix-matrix multiplication-based convolution in a multi-core system. |
first_indexed | 2024-03-12T14:48:22Z |
format | Article |
id | doaj.art-f70d874594ac4a1d8f71d34fb6d3d5c3 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-12T14:48:22Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-f70d874594ac4a1d8f71d34fb6d3d5c32023-08-15T23:00:21ZengIEEEIEEE Access2169-35362023-01-0111575145752810.1109/ACCESS.2023.328331210144741Analysis and Optimization of Direct Convolution Execution on Multi-Core ProcessorsMirco Mannino0https://orcid.org/0000-0003-1660-3984Biagio Peccerillo1https://orcid.org/0000-0002-4998-0092Andrea Mondelli2Sandro Bartolini3https://orcid.org/0000-0002-7975-3632Department of Information Engineering and Mathematics, University of Siena, Siena, ItalyDepartment of Information Engineering and Mathematics, University of Siena, Siena, ItalyHuawei Technologies Company Ltd., Shenzhen, ChinaDepartment of Information Engineering and Mathematics, University of Siena, Siena, ItalyNowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, <italic>direct convolution</italic> is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup) than matrix-matrix multiplication-based convolution in a multi-core system.https://ieeexplore.ieee.org/document/10144741/Convolutional neural networksdirect convolutionmulti-coremulti-threadingperformance evaluation |
spellingShingle | Mirco Mannino Biagio Peccerillo Andrea Mondelli Sandro Bartolini Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors IEEE Access Convolutional neural networks direct convolution multi-core multi-threading performance evaluation |
title | Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors |
title_full | Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors |
title_fullStr | Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors |
title_full_unstemmed | Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors |
title_short | Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors |
title_sort | analysis and optimization of direct convolution execution on multi core processors |
topic | Convolutional neural networks direct convolution multi-core multi-threading performance evaluation |
url | https://ieeexplore.ieee.org/document/10144741/ |
work_keys_str_mv | AT mircomannino analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors AT biagiopeccerillo analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors AT andreamondelli analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors AT sandrobartolini analysisandoptimizationofdirectconvolutionexecutiononmulticoreprocessors |