Democratizing Data Science through Interactive Curation of ML Pipelines

© 2019 Association for Computing Machinery. Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and...

Full description

Bibliographic Details
Main Authors: Shang, Zeyuan, Zgraggen, Emanuel, Buratti, Benedetto, Kossmann, Ferdinand, Eichmann, Philipp, Chung, Yeounoh, Binnig, Carsten, Upfal, Eli, Kraska, Tim
Format: Article
Language:English
Published: Association for Computing Machinery (ACM) 2021
Online Access:https://hdl.handle.net/1721.1/132275
_version_ 1811069494062219264
author Shang, Zeyuan
Zgraggen, Emanuel
Buratti, Benedetto
Kossmann, Ferdinand
Eichmann, Philipp
Chung, Yeounoh
Binnig, Carsten
Upfal, Eli
Kraska, Tim
author_facet Shang, Zeyuan
Zgraggen, Emanuel
Buratti, Benedetto
Kossmann, Ferdinand
Eichmann, Philipp
Chung, Yeounoh
Binnig, Carsten
Upfal, Eli
Kraska, Tim
author_sort Shang, Zeyuan
collection MIT
description © 2019 Association for Computing Machinery. Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and model selection. Domain experts are often overwhelmed by such complexity, de-facto inhibiting a wider adoption of ML techniques in other elds. Existing libraries that claim to solve this problem, still require well-trained practitioners. Those frameworks involve heavy data preparation steps and are often too slow for interactive feedback from the user, severely limiting the scope of such systems. In this paper we present Alpine Meadow, arst Interactive Automated Machine Learning tool. What makes our system unique is not only the focus on interactivity, but also the combined systemic and algorithmic design approach; on one hand we leverage ideas from query optimization, on the other we devise novel selection and pruning strategies combining cost-based Multi-Armed Bandits and Bayesian Optimization. We evaluate our system on over 300 datasets and compare against other AutoML tools, including the current NIPS winner, as well as expert solutions. Not only is Alpine Meadow able to signicantly outperform the other AutoML systems while - in contrast to the other systems - providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets we have never seen before.
first_indexed 2024-09-23T08:11:20Z
format Article
id mit-1721.1/132275
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T08:11:20Z
publishDate 2021
publisher Association for Computing Machinery (ACM)
record_format dspace
spelling mit-1721.1/1322752021-09-21T04:07:11Z Democratizing Data Science through Interactive Curation of ML Pipelines Shang, Zeyuan Zgraggen, Emanuel Buratti, Benedetto Kossmann, Ferdinand Eichmann, Philipp Chung, Yeounoh Binnig, Carsten Upfal, Eli Kraska, Tim © 2019 Association for Computing Machinery. Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and model selection. Domain experts are often overwhelmed by such complexity, de-facto inhibiting a wider adoption of ML techniques in other elds. Existing libraries that claim to solve this problem, still require well-trained practitioners. Those frameworks involve heavy data preparation steps and are often too slow for interactive feedback from the user, severely limiting the scope of such systems. In this paper we present Alpine Meadow, arst Interactive Automated Machine Learning tool. What makes our system unique is not only the focus on interactivity, but also the combined systemic and algorithmic design approach; on one hand we leverage ideas from query optimization, on the other we devise novel selection and pruning strategies combining cost-based Multi-Armed Bandits and Bayesian Optimization. We evaluate our system on over 300 datasets and compare against other AutoML tools, including the current NIPS winner, as well as expert solutions. Not only is Alpine Meadow able to signicantly outperform the other AutoML systems while - in contrast to the other systems - providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets we have never seen before. 2021-09-20T18:21:37Z 2021-09-20T18:21:37Z 2021-01-11T15:14:35Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/132275 en 10.1145/3299869.3319863 Proceedings of the ACM SIGMOD International Conference on Management of Data Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Association for Computing Machinery (ACM) Other repository
spellingShingle Shang, Zeyuan
Zgraggen, Emanuel
Buratti, Benedetto
Kossmann, Ferdinand
Eichmann, Philipp
Chung, Yeounoh
Binnig, Carsten
Upfal, Eli
Kraska, Tim
Democratizing Data Science through Interactive Curation of ML Pipelines
title Democratizing Data Science through Interactive Curation of ML Pipelines
title_full Democratizing Data Science through Interactive Curation of ML Pipelines
title_fullStr Democratizing Data Science through Interactive Curation of ML Pipelines
title_full_unstemmed Democratizing Data Science through Interactive Curation of ML Pipelines
title_short Democratizing Data Science through Interactive Curation of ML Pipelines
title_sort democratizing data science through interactive curation of ml pipelines
url https://hdl.handle.net/1721.1/132275
work_keys_str_mv AT shangzeyuan democratizingdatasciencethroughinteractivecurationofmlpipelines
AT zgraggenemanuel democratizingdatasciencethroughinteractivecurationofmlpipelines
AT burattibenedetto democratizingdatasciencethroughinteractivecurationofmlpipelines
AT kossmannferdinand democratizingdatasciencethroughinteractivecurationofmlpipelines
AT eichmannphilipp democratizingdatasciencethroughinteractivecurationofmlpipelines
AT chungyeounoh democratizingdatasciencethroughinteractivecurationofmlpipelines
AT binnigcarsten democratizingdatasciencethroughinteractivecurationofmlpipelines
AT upfaleli democratizingdatasciencethroughinteractivecurationofmlpipelines
AT kraskatim democratizingdatasciencethroughinteractivecurationofmlpipelines