ExoBLAS: Meta-Programming a High-Performance BLAS via Scheduling Automations

Kernel libraries are designed to support numerical computations and provide efficient implementations of them. The goal of these libraries is to provide many optimized functionalities, which is a challenge when the implementations of those programs are often written in C or assembly. BLAS (Basic Lin...

Full description

Bibliographic Details
Main Author: Droubi, Samir
Other Authors: Ragan-Kelley, Jonathan
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156752
Description
Summary:Kernel libraries are designed to support numerical computations and provide efficient implementations of them. The goal of these libraries is to provide many optimized functionalities, which is a challenge when the implementations of those programs are often written in C or assembly. BLAS (Basic Linear Algebra Subprograms) is a famous example of such libraries where the dimensionality of the interface imposes a huge space of functions to implement, which makes it particularly challenging to support. Our work tackles the problem of implementing BLAS in the context of meta-programming, particularly user-scheduling in the Exo programming language. We base our solution on three key ideas to achieve reuse at the level of the meta-program. First, there are similarities in the individual optimizations that are performed on these kernels, which we capture as scheduling operations with which we extend the Exo programming language. Secondly, the end-to-end optimization strategies (or schedules) for groups of these kernels are the same, and we capture them as scheduling automations. Lastly, more complex BLAS operations from higher levels can be transformed into less complex BLAS-like operations similar to operations from lower levels, so we can use the automation of a lower level to build the automation of a higher level. We evaluated our results against industry and open source implementations of BLAS and show that we achieve competitive performance with a small implementation in terms of lines of code.