Software prefetching for unstructured mesh applications

Applications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access pat...

Полное описание

Библиографические подробности
Главные авторы: Hadade, I, Jones, T, Wang, F, Di Mare, L
Формат: Conference item
Опубликовано: Association for Computing Machinery 2019
_version_ 1826278690092220416
author Hadade, I
Jones, T
Wang, F
Di Mare, L
author_facet Hadade, I
Jones, T
Wang, F
Di Mare, L
author_sort Hadade, I
collection OXFORD
description Applications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. This paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. Through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99×) architecture and the out-of-order Knights Landing (33%) many-core processor.
first_indexed 2024-03-06T23:47:42Z
format Conference item
id oxford-uuid:71833d85-072f-47ae-a2bc-822aa4aae83b
institution University of Oxford
last_indexed 2024-03-06T23:47:42Z
publishDate 2019
publisher Association for Computing Machinery
record_format dspace
spelling oxford-uuid:71833d85-072f-47ae-a2bc-822aa4aae83b2022-03-26T19:44:07ZSoftware prefetching for unstructured mesh applicationsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:71833d85-072f-47ae-a2bc-822aa4aae83bSymplectic Elements at OxfordAssociation for Computing Machinery2019Hadade, IJones, TWang, FDi Mare, LApplications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. This paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. Through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99×) architecture and the out-of-order Knights Landing (33%) many-core processor.
spellingShingle Hadade, I
Jones, T
Wang, F
Di Mare, L
Software prefetching for unstructured mesh applications
title Software prefetching for unstructured mesh applications
title_full Software prefetching for unstructured mesh applications
title_fullStr Software prefetching for unstructured mesh applications
title_full_unstemmed Software prefetching for unstructured mesh applications
title_short Software prefetching for unstructured mesh applications
title_sort software prefetching for unstructured mesh applications
work_keys_str_mv AT hadadei softwareprefetchingforunstructuredmeshapplications
AT jonest softwareprefetchingforunstructuredmeshapplications
AT wangf softwareprefetchingforunstructuredmeshapplications
AT dimarel softwareprefetchingforunstructuredmeshapplications