LANCET: labeling complex data at scale

<jats:p>Cutting-edge machine learning techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges i...

Full description

Bibliographic Details
Main Authors:	Zhang, Huayi, Cao, Lei, Madden, Samuel, Rundensteiner, Elke
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Article
Language:	English
Published:	VLDB Endowment 2022
Online Access:	https://hdl.handle.net/1721.1/143771

_version_	1826189696502333440
author	Zhang, Huayi Cao, Lei Madden, Samuel Rundensteiner, Elke
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Zhang, Huayi Cao, Lei Madden, Samuel Rundensteiner, Elke
author_sort	Zhang, Huayi
collection	MIT
description	<jats:p>Cutting-edge machine learning techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following research questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. These three questions are not only each challenging in their own right, but they also correspond to tightly interdependent problems. Yet existing techniques provide at best isolated solutions to a subset of these challenges. In this work, we propose the first approach, called LANCET, that successfully addresses all three challenges in an integrated framework. LANCET is based on a theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model, namely the Covariate-shift and the Continuity conditions. First, guided by the Covariate-shift condition, LANCET maps raw input data into a semantic feature space, where an unlabeled object is expected to share the same label with its near-by labeled neighbor. Next, guided by the Continuity condition, LANCET selects objects for labeling, aiming to ensure that unlabeled objects always have some sufficiently close labeled neighbors. These two strategies jointly maximize the accuracy of the automatically produced labels and the prediction accuracy of the machine learning models trained on these labels. Lastly, LANCET uses a distribution matching network to verify whether both the Covariate-shift and Continuity conditions hold, in which case it would be safe to terminate the labeling process. Our experiments on diverse public data sets demonstrate that LANCET consistently outperforms the state-of-the-art methods from Snuba to GOGGLES and other baselines by a large margin - up to 30 percentage points increase in accuracy.</jats:p>
first_indexed	2024-09-23T08:20:01Z
format	Article
id	mit-1721.1/143771
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T08:20:01Z
publishDate	2022
publisher	VLDB Endowment
record_format	dspace
spelling	mit-1721.1/1437712023-03-29T19:25:18Z LANCET: labeling complex data at scale Zhang, Huayi Cao, Lei Madden, Samuel Rundensteiner, Elke Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science <jats:p>Cutting-edge machine learning techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following research questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. These three questions are not only each challenging in their own right, but they also correspond to tightly interdependent problems. Yet existing techniques provide at best isolated solutions to a subset of these challenges. In this work, we propose the first approach, called LANCET, that successfully addresses all three challenges in an integrated framework. LANCET is based on a theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model, namely the Covariate-shift and the Continuity conditions. First, guided by the Covariate-shift condition, LANCET maps raw input data into a semantic feature space, where an unlabeled object is expected to share the same label with its near-by labeled neighbor. Next, guided by the Continuity condition, LANCET selects objects for labeling, aiming to ensure that unlabeled objects always have some sufficiently close labeled neighbors. These two strategies jointly maximize the accuracy of the automatically produced labels and the prediction accuracy of the machine learning models trained on these labels. Lastly, LANCET uses a distribution matching network to verify whether both the Covariate-shift and Continuity conditions hold, in which case it would be safe to terminate the labeling process. Our experiments on diverse public data sets demonstrate that LANCET consistently outperforms the state-of-the-art methods from Snuba to GOGGLES and other baselines by a large margin - up to 30 percentage points increase in accuracy.</jats:p> 2022-07-15T16:19:12Z 2022-07-15T16:19:12Z 2021 2022-07-15T16:03:56Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/143771 Zhang, Huayi, Cao, Lei, Madden, Samuel and Rundensteiner, Elke. 2021. "LANCET: labeling complex data at scale." Proceedings of the VLDB Endowment, 14 (11). en 10.14778/3476249.3476269 Proceedings of the VLDB Endowment Creative Commons Attribution-NonCommercial-NoDerivs License http://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf VLDB Endowment VLDB Endowment
spellingShingle	Zhang, Huayi Cao, Lei Madden, Samuel Rundensteiner, Elke LANCET: labeling complex data at scale
title	LANCET: labeling complex data at scale
title_full	LANCET: labeling complex data at scale
title_fullStr	LANCET: labeling complex data at scale
title_full_unstemmed	LANCET: labeling complex data at scale
title_short	LANCET: labeling complex data at scale
title_sort	lancet labeling complex data at scale
url	https://hdl.handle.net/1721.1/143771
work_keys_str_mv	AT zhanghuayi lancetlabelingcomplexdataatscale AT caolei lancetlabelingcomplexdataatscale AT maddensamuel lancetlabelingcomplexdataatscale AT rundensteinerelke lancetlabelingcomplexdataatscale

LANCET: labeling complex data at scale

Similar Items