Improving the Performance of Parallel Loops in OpenCilk

For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomi...

Full description

Bibliographic Details
Main Author:	Govedic, Luka
Other Authors:	Schardl, Tao B.
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/151273

_version_	1811094039186898944
author	Govedic, Luka
author2	Schardl, Tao B.
author_facet	Schardl, Tao B. Govedic, Luka
author_sort	Govedic, Luka
collection	MIT
description	For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomized work-stealing schedulers traditionally execute parallel-for loops using parallel divide-and-conquer recursion, which is theoretically efficient and scalable but can incur substantial overheads in practice. This thesis extends randomized work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I introduce loop frames to make work stealing on parallel-for loops more efficient and flexible. Loop frames make two key changes to work stealing for parallel-for loops. First, loop frames extend work stealing by directly encoding information about intervals of loop iterations in the runtime. Loop frames add first-class support to work stealing for parallel-for loops that composes with classical randomized work stealing. Second, loop frames allow intervals of loop iterations to be split on-the-fly, such that worker threads attempt to steal half of the unexecuted loop iterations rather than a deterministically constructed partition of loop iterations. On-the-fly loop splitting allows for more flexible dynamic load balancing of loop iterations while keeping the work overheads low and maintaining the theoretical efficiency of divide-and-conquer. I evaluate loop frames in practice by implementing loop frames in the OpenCilk runtime system. In particular, loop frames augment the THE protocol from Cilk to coordinate updates to loop frames. I observe that loop frames and on-the-fly loop splitting incur substantially less overhead than the divide-and-conquer algorithm without sacrificing parallel scalability. Finally, I study the impacts of increased locality in more than one dimension in nested loop applications. Results show that both cache-aware and cache-oblivious reordering of nested loop iterations can result in performance benefits up to a factor of 1.7×.
first_indexed	2024-09-23T15:54:31Z
format	Thesis
id	mit-1721.1/151273
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T15:54:31Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1512732023-08-01T03:08:28Z Improving the Performance of Parallel Loops in OpenCilk Govedic, Luka Schardl, Tao B. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science For good performance, parallel loop scheduling must achieve low scheduling overheads and multidimensional locality in nested loops. This thesis explores both challenges and contributes an extension to randomized work-stealing for first-class loop support that reduces scheduling overheads. Randomized work-stealing schedulers traditionally execute parallel-for loops using parallel divide-and-conquer recursion, which is theoretically efficient and scalable but can incur substantial overheads in practice. This thesis extends randomized work-stealing with a custom work-stealing protocol called on-the-fly loop splitting. I introduce loop frames to make work stealing on parallel-for loops more efficient and flexible. Loop frames make two key changes to work stealing for parallel-for loops. First, loop frames extend work stealing by directly encoding information about intervals of loop iterations in the runtime. Loop frames add first-class support to work stealing for parallel-for loops that composes with classical randomized work stealing. Second, loop frames allow intervals of loop iterations to be split on-the-fly, such that worker threads attempt to steal half of the unexecuted loop iterations rather than a deterministically constructed partition of loop iterations. On-the-fly loop splitting allows for more flexible dynamic load balancing of loop iterations while keeping the work overheads low and maintaining the theoretical efficiency of divide-and-conquer. I evaluate loop frames in practice by implementing loop frames in the OpenCilk runtime system. In particular, loop frames augment the THE protocol from Cilk to coordinate updates to loop frames. I observe that loop frames and on-the-fly loop splitting incur substantially less overhead than the divide-and-conquer algorithm without sacrificing parallel scalability. Finally, I study the impacts of increased locality in more than one dimension in nested loop applications. Results show that both cache-aware and cache-oblivious reordering of nested loop iterations can result in performance benefits up to a factor of 1.7×. M.Eng. 2023-07-31T19:27:42Z 2023-07-31T19:27:42Z 2023-06 2023-06-06T16:34:32.151Z Thesis https://hdl.handle.net/1721.1/151273 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Govedic, Luka Improving the Performance of Parallel Loops in OpenCilk
title	Improving the Performance of Parallel Loops in OpenCilk
title_full	Improving the Performance of Parallel Loops in OpenCilk
title_fullStr	Improving the Performance of Parallel Loops in OpenCilk
title_full_unstemmed	Improving the Performance of Parallel Loops in OpenCilk
title_short	Improving the Performance of Parallel Loops in OpenCilk
title_sort	improving the performance of parallel loops in opencilk
url	https://hdl.handle.net/1721.1/151273
work_keys_str_mv	AT govedicluka improvingtheperformanceofparallelloopsinopencilk

Improving the Performance of Parallel Loops in OpenCilk

Similar Items