Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critica...

Full description

Bibliographic Details
Main Authors:	Abdullah Al Hasib, Lasse Natvig, Per Gunnar Kjeldsberg, Juan M. Cebrián
Format:	Article
Language:	English
Published:	MDPI AG 2017-02-01
Series:	Journal of Low Power Electronics and Applications
Subjects:	performance energy efficiency data reuse transformation methodology vectorization parallel programming Xeon Phi processor KNL SIMD
Online Access:	http://www.mdpi.com/2079-9268/7/1/5

_version_	1811280007921664000
author	Abdullah Al Hasib Lasse Natvig Per Gunnar Kjeldsberg Juan M. Cebrián
author_facet	Abdullah Al Hasib Lasse Natvig Per Gunnar Kjeldsberg Juan M. Cebrián
author_sort	Abdullah Al Hasib
collection	DOAJ
description	Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses. In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multi- and many-core processors. As a test case, a full-search motion estimation kernel is evaluated on Intel® CoreTM i7-4700K (Haswell) and i7-2600K (Sandy Bridge) multi-core processors, as well as on an Intel® Xeon PhiTM many-core processor (Knights Landing) with Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) and Advanced Vector Extensions (AVX) instruction sets. Results using a single-threaded execution on the Haswell and Sandy Bridge systems show that performance and EDP (Energy Delay Product) can be improved through data reuse transformations on the scalar code by a factor of ≈3× and ≈6×, respectively. Compared to scalar code without data reuse optimization, the SSE/AVX2 version achieves ≈10×/17× better performance and ≈92×/307× better EDP, respectively. These results can be improved by 10% to 15% using data reuse techniques. Finally, the most optimized version using data reuse and AVX512 achieves a speedup of ≈35× and an EDP improvement of ≈1192× on the Xeon Phi system. While single-threaded execution serves as a common reference point for all architectures to analyze the effects of data reuse on both scalar and vector codes, scalability with thread count is also discussed in the paper.
first_indexed	2024-04-13T01:05:57Z
format	Article
id	doaj.art-cc2a2d3c1a9b455787b89a7ffb40a674
institution	Directory Open Access Journal
issn	2079-9268
language	English
last_indexed	2024-04-13T01:05:57Z
publishDate	2017-02-01
publisher	MDPI AG
record_format	Article
series	Journal of Low Power Electronics and Applications
spelling	doaj.art-cc2a2d3c1a9b455787b89a7ffb40a6742022-12-22T03:09:21ZengMDPI AGJournal of Low Power Electronics and Applications2079-92682017-02-0171510.3390/jlpea7010005jlpea7010005Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †Abdullah Al Hasib0Lasse Natvig1Per Gunnar Kjeldsberg2Juan M. Cebrián3Department of Computer Science, Norwegian University of Science and Technology (NTNU), Trondheim NO-7491, NorwayDepartment of Computer Science, Norwegian University of Science and Technology (NTNU), Trondheim NO-7491, NorwayDepartment of Electronic Systems, Norwegian University of Science and Technology (NTNU), Trondheim NO-7491, NorwayBarcelona Supercomputing Center (BSC), 08034 Barcelona, SpainThread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses. In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multi- and many-core processors. As a test case, a full-search motion estimation kernel is evaluated on Intel® CoreTM i7-4700K (Haswell) and i7-2600K (Sandy Bridge) multi-core processors, as well as on an Intel® Xeon PhiTM many-core processor (Knights Landing) with Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) and Advanced Vector Extensions (AVX) instruction sets. Results using a single-threaded execution on the Haswell and Sandy Bridge systems show that performance and EDP (Energy Delay Product) can be improved through data reuse transformations on the scalar code by a factor of ≈3× and ≈6×, respectively. Compared to scalar code without data reuse optimization, the SSE/AVX2 version achieves ≈10×/17× better performance and ≈92×/307× better EDP, respectively. These results can be improved by 10% to 15% using data reuse techniques. Finally, the most optimized version using data reuse and AVX512 achieves a speedup of ≈35× and an EDP improvement of ≈1192× on the Xeon Phi system. While single-threaded execution serves as a common reference point for all architectures to analyze the effects of data reuse on both scalar and vector codes, scalability with thread count is also discussed in the paper.http://www.mdpi.com/2079-9268/7/1/5performanceenergy efficiencydata reuse transformation methodologyvectorizationparallel programmingXeon Phi processorKNLSIMD
spellingShingle	Abdullah Al Hasib Lasse Natvig Per Gunnar Kjeldsberg Juan M. Cebrián Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study † Journal of Low Power Electronics and Applications performance energy efficiency data reuse transformation methodology vectorization parallel programming Xeon Phi processor KNL SIMD
title	Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †
title_full	Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †
title_fullStr	Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †
title_full_unstemmed	Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †
title_short	Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †
title_sort	energy efficiency effects of vectorization in data reuse transformations for many core processors a case study †
topic	performance energy efficiency data reuse transformation methodology vectorization parallel programming Xeon Phi processor KNL SIMD
url	http://www.mdpi.com/2079-9268/7/1/5
work_keys_str_mv	AT abdullahalhasib energyefficiencyeffectsofvectorizationindatareusetransformationsformanycoreprocessorsacasestudy AT lassenatvig energyefficiencyeffectsofvectorizationindatareusetransformationsformanycoreprocessorsacasestudy AT pergunnarkjeldsberg energyefficiencyeffectsofvectorizationindatareusetransformationsformanycoreprocessorsacasestudy AT juanmcebrian energyefficiencyeffectsofvectorizationindatareusetransformationsformanycoreprocessorsacasestudy

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Similar Items