Loop tiling in large-scale stencil codes at run-time with OPS

The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers...

Full description

Bibliographic Details
Main Authors:	Reguly, I, Mudalige, G, Giles, M
Format:	Journal article
Published:	Institute of Electrical and Electronics Engineers 2017

_version_	1826287527931150336
author	Reguly, I Mudalige, G Giles, M
author_facet	Reguly, I Mudalige, G Giles, M
author_sort	Reguly, I
collection	OXFORD
description	The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2$\times$ on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, $3.5\times$ on the linear solver TeaLeaf, and $1.7\times$ on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA&apos;s Marconi supercomputer. We also evaluate our algorithms on Intel&apos;s Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information.
first_indexed	2024-03-07T02:00:01Z
format	Journal article
id	oxford-uuid:9d0becfd-1de2-4713-b7e0-2ab70d01126f
institution	University of Oxford
last_indexed	2024-03-07T02:00:01Z
publishDate	2017
publisher	Institute of Electrical and Electronics Engineers
record_format	dspace
spelling	oxford-uuid:9d0becfd-1de2-4713-b7e0-2ab70d01126f2022-03-27T00:40:15ZLoop tiling in large-scale stencil codes at run-time with OPSJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:9d0becfd-1de2-4713-b7e0-2ab70d01126fSymplectic Elements at OxfordInstitute of Electrical and Electronics Engineers2017Reguly, IMudalige, GGiles, MThe key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2$\times$ on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, $3.5\times$ on the linear solver TeaLeaf, and $1.7\times$ on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA&apos;s Marconi supercomputer. We also evaluate our algorithms on Intel&apos;s Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information.
spellingShingle	Reguly, I Mudalige, G Giles, M Loop tiling in large-scale stencil codes at run-time with OPS
title	Loop tiling in large-scale stencil codes at run-time with OPS
title_full	Loop tiling in large-scale stencil codes at run-time with OPS
title_fullStr	Loop tiling in large-scale stencil codes at run-time with OPS
title_full_unstemmed	Loop tiling in large-scale stencil codes at run-time with OPS
title_short	Loop tiling in large-scale stencil codes at run-time with OPS
title_sort	loop tiling in large scale stencil codes at run time with ops
work_keys_str_mv	AT regulyi looptilinginlargescalestencilcodesatruntimewithops AT mudaligeg looptilinginlargescalestencilcodesatruntimewithops AT gilesm looptilinginlargescalestencilcodesatruntimewithops

Loop tiling in large-scale stencil codes at run-time with OPS

Similar Items