Coresets for scalable Bayesian logistic regression

The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large da...

Full description

Bibliographic Details
Main Authors: Huggins, Jonathan H., Campbell, Trevor David, Broderick, Tamara A
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:English
Published: Curran 2021
Online Access:https://hdl.handle.net/1721.1/129582
_version_ 1811077363110248448
author Huggins, Jonathan H.
Campbell, Trevor David
Broderick, Tamara A
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Huggins, Jonathan H.
Campbell, Trevor David
Broderick, Tamara A
author_sort Huggins, Jonathan H.
collection MIT
description The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large datasets difficult or infeasible. Recent work on scaling Bayesian inference has focused on modifying the underlying algorithms to, for example, use only a random data subsample at each iteration. We leverage the insight that data is often redundant to instead obtain a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We can then use this small coreset in any number of existing posterior inference algorithms without modification. In this paper, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset - both for fixed, known datasets, and in expectation for a wide class of data generative models. Crucially, the proposed approach also permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size. Furthermore, constructing the coreset takes a negligible amount of time compared to that required to run MCMC on it.
first_indexed 2024-09-23T10:41:45Z
format Article
id mit-1721.1/129582
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T10:41:45Z
publishDate 2021
publisher Curran
record_format dspace
spelling mit-1721.1/1295822022-09-27T14:21:47Z Coresets for scalable Bayesian logistic regression Huggins, Jonathan H. Campbell, Trevor David Broderick, Tamara A Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large datasets difficult or infeasible. Recent work on scaling Bayesian inference has focused on modifying the underlying algorithms to, for example, use only a random data subsample at each iteration. We leverage the insight that data is often redundant to instead obtain a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We can then use this small coreset in any number of existing posterior inference algorithms without modification. In this paper, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset - both for fixed, known datasets, and in expectation for a wide class of data generative models. Crucially, the proposed approach also permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size. Furthermore, constructing the coreset takes a negligible amount of time compared to that required to run MCMC on it. United States. Office of Naval Research. Multidisciplinary University Research Initiative (Grant N000141110688) 2021-01-27T18:50:42Z 2021-01-27T18:50:42Z 2016-12 2020-12-03T17:45:31Z Article http://purl.org/eprint/type/ConferencePaper 1049-5258 https://hdl.handle.net/1721.1/129582 Huggins, Jonathan H. et al. “Coresets for scalable Bayesian logistic regression.” Paper presented at the 30th Conference on Neural Information Processing Systems (NIPS 2016), Bacelona, Spain, December 5-10 2016, Curran © 2016 The Author(s) en https://papers.nips.cc/paper/2016/hash/2b0f658cbffd284984fb11d90254081f-Abstract.html 30th Conference on Neural Information Processing Systems (NIPS 2016) Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. application/pdf Curran Neural Information Processing Systems (NIPS)
spellingShingle Huggins, Jonathan H.
Campbell, Trevor David
Broderick, Tamara A
Coresets for scalable Bayesian logistic regression
title Coresets for scalable Bayesian logistic regression
title_full Coresets for scalable Bayesian logistic regression
title_fullStr Coresets for scalable Bayesian logistic regression
title_full_unstemmed Coresets for scalable Bayesian logistic regression
title_short Coresets for scalable Bayesian logistic regression
title_sort coresets for scalable bayesian logistic regression
url https://hdl.handle.net/1721.1/129582
work_keys_str_mv AT hugginsjonathanh coresetsforscalablebayesianlogisticregression
AT campbelltrevordavid coresetsforscalablebayesianlogisticregression
AT brodericktamaraa coresetsforscalablebayesianlogisticregression