Improving the Performance of Parallel Loops in OpenCilk

For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomi...

Full description

Bibliographic Details
Main Author: Govedic, Luka
Other Authors: Schardl, Tao B.
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/151273
_version_ 1811094039186898944
author Govedic, Luka
author2 Schardl, Tao B.
author_facet Schardl, Tao B.
Govedic, Luka
author_sort Govedic, Luka
collection MIT
description For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomized work-stealing schedulers traditionally execute parallel-for loops using parallel divide-and-conquer recursion, which is theoretically efficient and scalable but can incur substantial overheads in practice. This thesis extends randomized work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I introduce loop frames to make work stealing on parallel-for loops more efficient and flexible. Loop frames make two key changes to work stealing for parallel-for loops. First, loop frames extend work stealing by directly encoding information about intervals of loop iterations in the runtime. Loop frames add first-class support to work stealing for parallel-for loops that composes with classical randomized work stealing. Second, loop frames allow intervals of loop iterations to be split on-the-fly, such that worker threads attempt to steal half of the unexecuted loop iterations rather than a deterministically constructed partition of loop iterations. On-the-fly loop splitting allows for more flexible dynamic load balancing of loop iterations while keeping the work overheads low and maintaining the theoretical efficiency of divide-and-conquer. I evaluate loop frames in practice by implementing loop frames in the OpenCilk runtime system. In particular, loop frames augment the THE protocol from Cilk to coordinate updates to loop frames. I observe that loop frames and on-the-fly loop splitting incur substantially less overhead than the divide-and-conquer algorithm without sacrificing parallel scalability. Finally, I study the impacts of increased locality in more than one dimension in nested loop applications. Results show that both cache-aware and cache-oblivious reordering of nested loop iterations can result in performance benefits up to a factor of 1.7×.
first_indexed 2024-09-23T15:54:31Z
format Thesis
id mit-1721.1/151273
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T15:54:31Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1512732023-08-01T03:08:28Z Improving the Performance of Parallel Loops in OpenCilk Govedic, Luka Schardl, Tao B. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomized work-stealing schedulers traditionally execute parallel-for loops using parallel divide-and-conquer recursion, which is theoretically efficient and scalable but can incur substantial overheads in practice. This thesis extends randomized work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I introduce loop frames to make work stealing on parallel-for loops more efficient and flexible. Loop frames make two key changes to work stealing for parallel-for loops. First, loop frames extend work stealing by directly encoding information about intervals of loop iterations in the runtime. Loop frames add first-class support to work stealing for parallel-for loops that composes with classical randomized work stealing. Second, loop frames allow intervals of loop iterations to be split on-the-fly, such that worker threads attempt to steal half of the unexecuted loop iterations rather than a deterministically constructed partition of loop iterations. On-the-fly loop splitting allows for more flexible dynamic load balancing of loop iterations while keeping the work overheads low and maintaining the theoretical efficiency of divide-and-conquer. I evaluate loop frames in practice by implementing loop frames in the OpenCilk runtime system. In particular, loop frames augment the THE protocol from Cilk to coordinate updates to loop frames. I observe that loop frames and on-the-fly loop splitting incur substantially less overhead than the divide-and-conquer algorithm without sacrificing parallel scalability. Finally, I study the impacts of increased locality in more than one dimension in nested loop applications. Results show that both cache-aware and cache-oblivious reordering of nested loop iterations can result in performance benefits up to a factor of 1.7×. M.Eng. 2023-07-31T19:27:42Z 2023-07-31T19:27:42Z 2023-06 2023-06-06T16:34:32.151Z Thesis https://hdl.handle.net/1721.1/151273 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Govedic, Luka
Improving the Performance of Parallel Loops in OpenCilk
title Improving the Performance of Parallel Loops in OpenCilk
title_full Improving the Performance of Parallel Loops in OpenCilk
title_fullStr Improving the Performance of Parallel Loops in OpenCilk
title_full_unstemmed Improving the Performance of Parallel Loops in OpenCilk
title_short Improving the Performance of Parallel Loops in OpenCilk
title_sort improving the performance of parallel loops in opencilk
url https://hdl.handle.net/1721.1/151273
work_keys_str_mv AT govedicluka improvingtheperformanceofparallelloopsinopencilk