Vectorizing unstructured mesh computations for many-core architectures

Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular ap...

Full description

Bibliographic Details
Main Authors:	Giles, M, Mudalige, G
Format:	Journal article
Published:	John Wiley and Sons, Ltd 2015

_version_	1797103428828135424
author	Giles, M Mudalige, G
author_facet	Giles, M Mudalige, G
author_sort	Giles, M
collection	OXFORD
description	Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon-Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many-core systems. We show that auto-vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near-optimal performance, two times faster than non-vectorized code. We observe that the Xeon-Phi does not provide good performance for these applications but is still comparable with a pair of mid-range Xeon chips.
first_indexed	2024-03-07T06:19:58Z
format	Journal article
id	oxford-uuid:f2666c39-71fa-4b66-bac7-46e1414bd572
institution	University of Oxford
last_indexed	2024-03-07T06:19:58Z
publishDate	2015
publisher	John Wiley and Sons, Ltd
record_format	dspace
spelling	oxford-uuid:f2666c39-71fa-4b66-bac7-46e1414bd5722022-03-27T12:03:21ZVectorizing unstructured mesh computations for many-core architecturesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:f2666c39-71fa-4b66-bac7-46e1414bd572Symplectic Elements at OxfordJohn Wiley and Sons, Ltd2015Giles, MMudalige, GAchieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon-Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many-core systems. We show that auto-vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near-optimal performance, two times faster than non-vectorized code. We observe that the Xeon-Phi does not provide good performance for these applications but is still comparable with a pair of mid-range Xeon chips.
spellingShingle	Giles, M Mudalige, G Vectorizing unstructured mesh computations for many-core architectures
title	Vectorizing unstructured mesh computations for many-core architectures
title_full	Vectorizing unstructured mesh computations for many-core architectures
title_fullStr	Vectorizing unstructured mesh computations for many-core architectures
title_full_unstemmed	Vectorizing unstructured mesh computations for many-core architectures
title_short	Vectorizing unstructured mesh computations for many-core architectures
title_sort	vectorizing unstructured mesh computations for many core architectures
work_keys_str_mv	AT gilesm vectorizingunstructuredmeshcomputationsformanycorearchitectures AT mudaligeg vectorizingunstructuredmeshcomputationsformanycorearchitectures

Vectorizing unstructured mesh computations for many-core architectures

Similar Items