Bottom-Up Standardization For Data Preparation
Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not the novel or interesting piece of a project — it transforms raw data into a format that enables further innovative w...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/153866 https://orcid.org/0009-0005-1349-1376 |
Summary: | Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not the novel or interesting piece of a project — it transforms raw data into a format that enables further innovative work. Because data preparation scripts are never intended to be interesting, are project-specific, and are written in general-purpose languages, they can be tedious to understand and check. As a result, data preparation scripts can easily become a breeding ground for poor engineering and statistical practices.
Ideally, data preparation scripts are “admirably boring” — they should serve the project, but otherwise be as simple and as standard as possible. We propose a bottom-up script standardization framework that takes a user’s data preparation script and transforms it into a simpler, more standardized, more boring version of itself. Our framework takes the user’s input script not as an unchangeable definition of correctness, but as a semantic sketch of the user’s overall intent. We present an algorithmic framework and implemented a prototype system. We evaluate our approach against state-of-the-art methods, including GPT-4, on six real-world datasets. Our approach improves script standardization by 39.5% while not meaningfully changing the user’s intent, while GPT-4 achieves 2.9%. |
---|