Bottom-Up Standardization For Data Preparation

Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not the novel or interesting piece of a project — it transforms raw data into a format that enables further innovative w...

Full description

Bibliographic Details
Main Author: Lai, Eugenie Y.
Other Authors: Cafarella, Michael J.
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/153866
https://orcid.org/0009-0005-1349-1376
Description
Summary:Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not the novel or interesting piece of a project — it transforms raw data into a format that enables further innovative work. Because data preparation scripts are never intended to be interesting, are project-specific, and are written in general-purpose languages, they can be tedious to understand and check. As a result, data preparation scripts can easily become a breeding ground for poor engineering and statistical practices. Ideally, data preparation scripts are “admirably boring” — they should serve the project, but otherwise be as simple and as standard as possible. We propose a bottom-up script standardization framework that takes a user’s data preparation script and transforms it into a simpler, more standardized, more boring version of itself. Our framework takes the user’s input script not as an unchangeable definition of correctness, but as a semantic sketch of the user’s overall intent. We present an algorithmic framework and implemented a prototype system. We evaluate our approach against state-of-the-art methods, including GPT-4, on six real-world datasets. Our approach improves script standardization by 39.5% while not meaningfully changing the user’s intent, while GPT-4 achieves 2.9%.