Software prefetching for unstructured mesh applications
Applications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access pat...
Главные авторы: | , , , |
---|---|
Формат: | Conference item |
Опубликовано: |
Association for Computing Machinery
2019
|
_version_ | 1826278690092220416 |
---|---|
author | Hadade, I Jones, T Wang, F Di Mare, L |
author_facet | Hadade, I Jones, T Wang, F Di Mare, L |
author_sort | Hadade, I |
collection | OXFORD |
description | Applications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. This paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. Through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99×) architecture and the out-of-order Knights Landing (33%) many-core processor. |
first_indexed | 2024-03-06T23:47:42Z |
format | Conference item |
id | oxford-uuid:71833d85-072f-47ae-a2bc-822aa4aae83b |
institution | University of Oxford |
last_indexed | 2024-03-06T23:47:42Z |
publishDate | 2019 |
publisher | Association for Computing Machinery |
record_format | dspace |
spelling | oxford-uuid:71833d85-072f-47ae-a2bc-822aa4aae83b2022-03-26T19:44:07ZSoftware prefetching for unstructured mesh applicationsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:71833d85-072f-47ae-a2bc-822aa4aae83bSymplectic Elements at OxfordAssociation for Computing Machinery2019Hadade, IJones, TWang, FDi Mare, LApplications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. This paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. Through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99×) architecture and the out-of-order Knights Landing (33%) many-core processor. |
spellingShingle | Hadade, I Jones, T Wang, F Di Mare, L Software prefetching for unstructured mesh applications |
title | Software prefetching for unstructured mesh applications |
title_full | Software prefetching for unstructured mesh applications |
title_fullStr | Software prefetching for unstructured mesh applications |
title_full_unstemmed | Software prefetching for unstructured mesh applications |
title_short | Software prefetching for unstructured mesh applications |
title_sort | software prefetching for unstructured mesh applications |
work_keys_str_mv | AT hadadei softwareprefetchingforunstructuredmeshapplications AT jonest softwareprefetchingforunstructuredmeshapplications AT wangf softwareprefetchingforunstructuredmeshapplications AT dimarel softwareprefetchingforunstructuredmeshapplications |