Query Optimization for Dynamic Imputation

© 2017 VLDB. Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples withmissing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our system, ImputeDB, incorporates imputa...

Full description

Bibliographic Details
Main Authors: Cambronero, José, Feser, John K., Smith, Micah J., Madden, Samuel
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:English
Published: VLDB Endowment 2021
Online Access:https://hdl.handle.net/1721.1/137765
Description
Summary:© 2017 VLDB. Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples withmissing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our system, ImputeDB, incorporates imputation into a costbased query optimizer, performing necessary imputations onthefly for each query. This allows users to immediately explore their data, while the system picks the optimal placement of imputation operations. We evaluate this approach on three real-world survey-based datasets. Our experiments show that our query plans execute between 10 and 140 times faster than first imputing the base tables. Furthermore, we show that the query results from on-the-fly imputation differ from the traditional base-table imputation approach by 0-8%. Finally, we show that while dropping tuples with missing values that fail query constraints discards 6-78% of the data, on-the-fly imputation loses only 0-21%.