Software prefetching for unstructured mesh applications

Applications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access pat...

Полное описание

Библиографические подробности
Главные авторы:	Hadade, I, Jones, T, Wang, F, Di Mare, L
Формат:	Conference item
Опубликовано:	Association for Computing Machinery 2019

_version_	1826278690092220416
author	Hadade, I Jones, T Wang, F Di Mare, L
author_facet	Hadade, I Jones, T Wang, F Di Mare, L
author_sort	Hadade, I
collection	OXFORD
description	Applications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. This paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. Through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99×) architecture and the out-of-order Knights Landing (33%) many-core processor.
first_indexed	2024-03-06T23:47:42Z
format	Conference item
id	oxford-uuid:71833d85-072f-47ae-a2bc-822aa4aae83b
institution	University of Oxford
last_indexed	2024-03-06T23:47:42Z
publishDate	2019
publisher	Association for Computing Machinery
record_format	dspace
spelling	oxford-uuid:71833d85-072f-47ae-a2bc-822aa4aae83b2022-03-26T19:44:07ZSoftware prefetching for unstructured mesh applicationsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:71833d85-072f-47ae-a2bc-822aa4aae83bSymplectic Elements at OxfordAssociation for Computing Machinery2019Hadade, IJones, TWang, FDi Mare, LApplications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. This paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. Through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99×) architecture and the out-of-order Knights Landing (33%) many-core processor.
spellingShingle	Hadade, I Jones, T Wang, F Di Mare, L Software prefetching for unstructured mesh applications
title	Software prefetching for unstructured mesh applications
title_full	Software prefetching for unstructured mesh applications
title_fullStr	Software prefetching for unstructured mesh applications
title_full_unstemmed	Software prefetching for unstructured mesh applications
title_short	Software prefetching for unstructured mesh applications
title_sort	software prefetching for unstructured mesh applications
work_keys_str_mv	AT hadadei softwareprefetchingforunstructuredmeshapplications AT jonest softwareprefetchingforunstructuredmeshapplications AT wangf softwareprefetchingforunstructuredmeshapplications AT dimarel softwareprefetchingforunstructuredmeshapplications

Software prefetching for unstructured mesh applications

Схожие документы