Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism

With the slowing down of Moore’s Law, silicon fabrication technology is not yielding the performance improvements it once did. Hardware accelerators, which tailor their architecture to a specific application or domain, have emerged as an attractive approach to improve performance. Unfortunately, cur...

Full description

Bibliographic Details
Main Author: Abeydeera, Maleen
Other Authors: Sanchez, Daniel
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/140000
_version_ 1811091963718402048
author Abeydeera, Maleen
author2 Sanchez, Daniel
author_facet Sanchez, Daniel
Abeydeera, Maleen
author_sort Abeydeera, Maleen
collection MIT
description With the slowing down of Moore’s Law, silicon fabrication technology is not yielding the performance improvements it once did. Hardware accelerators, which tailor their architecture to a specific application or domain, have emerged as an attractive approach to improve performance. Unfortunately, current accelerators have been limited to domains such as deep learning, where parallelism is easy to exploit. Many applications do not have such easy-to-extract parallelism and have remained off-limits to accelerators. This thesis presents techniques to build accelerators for applications with speculative parallelism. These applications consist of atomic tasks, sometimes with order constraints, and need speculative execution to extract parallelism. In speculative execution, tasks are executed in parallel assuming they are independent. A runtime system monitors their execution to see if they are. If a task produces a conflict during execution, i.e., if it may violate a data dependence, then it is aborted and re-executed. This thesis proposes Chronos, a framework-based approach for building accelerators that use speculation to extract parallelism. Under Chronos, accelerator designers express the algorithm as a set of ordered tasks, and then design processing elements (PEs) to execute each of these tasks. The framework provides reusable components for task management and speculative execution, saving most of the developer effort in creating accelerators for new applications. Prior general-purpose architectures have leveraged already existing techniques, like cache-coherence protocols, for conflict detection, but implementing coherence would add complexity, latency and significant on-chip storage requirement, making these techniques expensive on accelerators. To tackle this challenge, we first propose a new execution model, Spatially Located Ordered Tasks (SLOT), that uses order as the only synchronization mechanism and limits task accesses to a single read-write object. We then use SLOT to implement the Chronos framework. This implementation avoids the need for cache coherence and makes speculative execution cheap and distributed. This reduces overheads and improves performance by up to 2× over prior conflict detection techniques. While SLOT achieves excellent performance on many algorithms, it is sometimes desirable to allow a single task to access multiple objects. Thus, we extend Chronos to support the more general Swarm execution model, which allows this and is also easier to program. This Chronos-Swarm implementation improves performance when Swarm’s features are needed, but it hurts performance when they are not, as the Swarm execution model requires more expensive conflict checks on each memory access. To bridge this gap, we introduce a hybrid SLOT/Swarm execution model that combines the generality and ease-of-programming of Swarm with the performance of SLOT. We develop FPGA implementations of Chronos and use them to build accelerators for several challenging applications. When run on cloud FPGA instances, these accelerators outperform state-of-the-art software versions running on a higher-priced multicore instance by 3.5× to 15.3×.
first_indexed 2024-09-23T15:10:25Z
format Thesis
id mit-1721.1/140000
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T15:10:25Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1400002022-02-08T03:12:39Z Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism Abeydeera, Maleen Sanchez, Daniel Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science With the slowing down of Moore’s Law, silicon fabrication technology is not yielding the performance improvements it once did. Hardware accelerators, which tailor their architecture to a specific application or domain, have emerged as an attractive approach to improve performance. Unfortunately, current accelerators have been limited to domains such as deep learning, where parallelism is easy to exploit. Many applications do not have such easy-to-extract parallelism and have remained off-limits to accelerators. This thesis presents techniques to build accelerators for applications with speculative parallelism. These applications consist of atomic tasks, sometimes with order constraints, and need speculative execution to extract parallelism. In speculative execution, tasks are executed in parallel assuming they are independent. A runtime system monitors their execution to see if they are. If a task produces a conflict during execution, i.e., if it may violate a data dependence, then it is aborted and re-executed. This thesis proposes Chronos, a framework-based approach for building accelerators that use speculation to extract parallelism. Under Chronos, accelerator designers express the algorithm as a set of ordered tasks, and then design processing elements (PEs) to execute each of these tasks. The framework provides reusable components for task management and speculative execution, saving most of the developer effort in creating accelerators for new applications. Prior general-purpose architectures have leveraged already existing techniques, like cache-coherence protocols, for conflict detection, but implementing coherence would add complexity, latency and significant on-chip storage requirement, making these techniques expensive on accelerators. To tackle this challenge, we first propose a new execution model, Spatially Located Ordered Tasks (SLOT), that uses order as the only synchronization mechanism and limits task accesses to a single read-write object. We then use SLOT to implement the Chronos framework. This implementation avoids the need for cache coherence and makes speculative execution cheap and distributed. This reduces overheads and improves performance by up to 2× over prior conflict detection techniques. While SLOT achieves excellent performance on many algorithms, it is sometimes desirable to allow a single task to access multiple objects. Thus, we extend Chronos to support the more general Swarm execution model, which allows this and is also easier to program. This Chronos-Swarm implementation improves performance when Swarm’s features are needed, but it hurts performance when they are not, as the Swarm execution model requires more expensive conflict checks on each memory access. To bridge this gap, we introduce a hybrid SLOT/Swarm execution model that combines the generality and ease-of-programming of Swarm with the performance of SLOT. We develop FPGA implementations of Chronos and use them to build accelerators for several challenging applications. When run on cloud FPGA instances, these accelerators outperform state-of-the-art software versions running on a higher-priced multicore instance by 3.5× to 15.3×. Ph.D. 2022-02-07T15:18:14Z 2022-02-07T15:18:14Z 2021-09 2021-09-21T19:31:10.009Z Thesis https://hdl.handle.net/1721.1/140000 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Abeydeera, Maleen
Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism
title Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism
title_full Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism
title_fullStr Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism
title_full_unstemmed Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism
title_short Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism
title_sort scalable and broad hardware acceleration through practical speculative parallelism
url https://hdl.handle.net/1721.1/140000
work_keys_str_mv AT abeydeeramaleen scalableandbroadhardwareaccelerationthroughpracticalspeculativeparallelism