Improving the Performance of Parallel Loops in OpenCilk
For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomi...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/151273 |
_version_ | 1811094039186898944 |
---|---|
author | Govedic, Luka |
author2 | Schardl, Tao B. |
author_facet | Schardl, Tao B. Govedic, Luka |
author_sort | Govedic, Luka |
collection | MIT |
description | For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads.
Randomized work-stealing schedulers traditionally execute parallel-for loops using parallel divide-and-conquer recursion, which is theoretically efficient and scalable but can incur substantial overheads in practice. This thesis extends randomized work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I introduce loop frames to make work stealing on parallel-for loops more efficient and flexible.
Loop frames make two key changes to work stealing for parallel-for loops. First, loop frames extend work stealing by directly encoding information about intervals of loop iterations in the runtime. Loop frames add first-class support to work stealing for parallel-for loops that composes with classical randomized work stealing. Second, loop frames allow intervals of loop iterations to be split on-the-fly, such that worker threads attempt to steal half of the unexecuted loop iterations rather than a deterministically constructed partition of loop iterations. On-the-fly loop splitting allows for more flexible dynamic load balancing of loop iterations while keeping the work overheads low and maintaining the theoretical efficiency of divide-and-conquer.
I evaluate loop frames in practice by implementing loop frames in the OpenCilk runtime system. In particular, loop frames augment the THE protocol from Cilk to coordinate updates to loop frames. I observe that loop frames and on-the-fly loop splitting incur substantially less overhead than the divide-and-conquer algorithm without sacrificing parallel scalability.
Finally, I study the impacts of increased locality in more than one dimension in nested loop applications. Results show that both cache-aware and cache-oblivious reordering of nested loop iterations can result in performance benefits up to a factor of 1.7×. |
first_indexed | 2024-09-23T15:54:31Z |
format | Thesis |
id | mit-1721.1/151273 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T15:54:31Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1512732023-08-01T03:08:28Z Improving the Performance of Parallel Loops in OpenCilk Govedic, Luka Schardl, Tao B. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomized work-stealing schedulers traditionally execute parallel-for loops using parallel divide-and-conquer recursion, which is theoretically efficient and scalable but can incur substantial overheads in practice. This thesis extends randomized work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I introduce loop frames to make work stealing on parallel-for loops more efficient and flexible. Loop frames make two key changes to work stealing for parallel-for loops. First, loop frames extend work stealing by directly encoding information about intervals of loop iterations in the runtime. Loop frames add first-class support to work stealing for parallel-for loops that composes with classical randomized work stealing. Second, loop frames allow intervals of loop iterations to be split on-the-fly, such that worker threads attempt to steal half of the unexecuted loop iterations rather than a deterministically constructed partition of loop iterations. On-the-fly loop splitting allows for more flexible dynamic load balancing of loop iterations while keeping the work overheads low and maintaining the theoretical efficiency of divide-and-conquer. I evaluate loop frames in practice by implementing loop frames in the OpenCilk runtime system. In particular, loop frames augment the THE protocol from Cilk to coordinate updates to loop frames. I observe that loop frames and on-the-fly loop splitting incur substantially less overhead than the divide-and-conquer algorithm without sacrificing parallel scalability. Finally, I study the impacts of increased locality in more than one dimension in nested loop applications. Results show that both cache-aware and cache-oblivious reordering of nested loop iterations can result in performance benefits up to a factor of 1.7×. M.Eng. 2023-07-31T19:27:42Z 2023-07-31T19:27:42Z 2023-06 2023-06-06T16:34:32.151Z Thesis https://hdl.handle.net/1721.1/151273 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Govedic, Luka Improving the Performance of Parallel Loops in OpenCilk |
title | Improving the Performance of Parallel Loops in OpenCilk |
title_full | Improving the Performance of Parallel Loops in OpenCilk |
title_fullStr | Improving the Performance of Parallel Loops in OpenCilk |
title_full_unstemmed | Improving the Performance of Parallel Loops in OpenCilk |
title_short | Improving the Performance of Parallel Loops in OpenCilk |
title_sort | improving the performance of parallel loops in opencilk |
url | https://hdl.handle.net/1721.1/151273 |
work_keys_str_mv | AT govedicluka improvingtheperformanceofparallelloopsinopencilk |