ProgGen: Automatic Dataset Generation for the Halide Domain Specific Language

Compilers use cost models to choose between different optimization opportunities, and increasingly these cost models are developed using data-driven techniques. Compilers for general-purpose languages rely on large real-world program datasets to train their cost models. However, cost models for doma...

Full description

Bibliographic Details
Main Author: Holbrook, Zachary
Other Authors: Amarasinghe, Saman
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/139199
Description
Summary:Compilers use cost models to choose between different optimization opportunities, and increasingly these cost models are developed using data-driven techniques. Compilers for general-purpose languages rely on large real-world program datasets to train their cost models. However, cost models for domain-specific languages often have to use program generators due to a lack of large datasets of real-world programs. Program dataset generators are typically manually constructed or handwritten to generate programs in a randomly guided way. However, writing a program generator is time-consuming and requires considerable tuning to produce programs with realistic computation patterns in the desired domain. This thesis presents ProgGen, a program generator inspired by genetic programming for automatically generating program datasets used in training compiler cost models. ProgGen automatically produces program datasets in different domains by starting with a small initial set of programs in the desired domain. I compare ProgGen with the random program generator used in the Halide Autoscheduler [1]. While the Halide random program generator performs better in the image processing and neural network domains it was designed for, ProgGen is competitive in video processing and linear algebra domains. Due to the automatic nature of ProgGen, ProgGen can also generate programs in new domains with far less engineering time.