Bottom-Up Standardization For Data Preparation

Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not the novel or interesting piece of a project — it transforms raw data into a format that enables further innovative w...

Full description

Bibliographic Details
Main Author:	Lai, Eugenie Y.
Other Authors:	Cafarella, Michael J.
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/153866 https://orcid.org/0009-0005-1349-1376

Description
Summary:	Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not the novel or interesting piece of a project — it transforms raw data into a format that enables further innovative work. Because data preparation scripts are never intended to be interesting, are project-specific, and are written in general-purpose languages, they can be tedious to understand and check. As a result, data preparation scripts can easily become a breeding ground for poor engineering and statistical practices. Ideally, data preparation scripts are “admirably boring” — they should serve the project, but otherwise be as simple and as standard as possible. We propose a bottom-up script standardization framework that takes a user’s data preparation script and transforms it into a simpler, more standardized, more boring version of itself. Our framework takes the user’s input script not as an unchangeable definition of correctness, but as a semantic sketch of the user’s overall intent. We present an algorithmic framework and implemented a prototype system. We evaluate our approach against state-of-the-art methods, including GPT-4, on six real-world datasets. Our approach improves script standardization by 39.5% while not meaningfully changing the user’s intent, while GPT-4 achieves 2.9%.

Bottom-Up Standardization For Data Preparation

Similar Items