Super-sampling with a reservoir
We introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall ap...
Asıl Yazarlar: | , , |
---|---|
Materyal Türü: | Conference item |
Baskı/Yayın Bilgisi: |
AUAI Press
2016
|
_version_ | 1826275714638282752 |
---|---|
author | Paige, B Sejdinovic, D Wood, F |
author_facet | Paige, B Sejdinovic, D Wood, F |
author_sort | Paige, B |
collection | OXFORD |
description | We introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall approximation to the full data set, as judged using a kernel two-sample test. This produces subsets which minimize the worst-case relative error when computing expectations of functions in a specified function class, using just the samples from the subset. Kernel functions are approximated using random Fourier features, and the subset of samples itself is stored in a random projection tree. The resulting algorithm runs in a single pass through the whole data set, and has a per-iteration computational complexity logarithmic in the size of the subset. These “supersamples” subsampled from the full data provide a concise summary, as demonstrated empirically on mixture models and the MNIST dataset. |
first_indexed | 2024-03-06T23:02:59Z |
format | Conference item |
id | oxford-uuid:62cf0b18-df18-4729-bd1f-21a9e834e2b9 |
institution | University of Oxford |
last_indexed | 2024-03-06T23:02:59Z |
publishDate | 2016 |
publisher | AUAI Press |
record_format | dspace |
spelling | oxford-uuid:62cf0b18-df18-4729-bd1f-21a9e834e2b92022-03-26T18:08:40ZSuper-sampling with a reservoirConference itemhttp://purl.org/coar/resource_type/c_5794uuid:62cf0b18-df18-4729-bd1f-21a9e834e2b9Symplectic Elements at OxfordAUAI Press2016Paige, BSejdinovic, DWood, FWe introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall approximation to the full data set, as judged using a kernel two-sample test. This produces subsets which minimize the worst-case relative error when computing expectations of functions in a specified function class, using just the samples from the subset. Kernel functions are approximated using random Fourier features, and the subset of samples itself is stored in a random projection tree. The resulting algorithm runs in a single pass through the whole data set, and has a per-iteration computational complexity logarithmic in the size of the subset. These “supersamples” subsampled from the full data provide a concise summary, as demonstrated empirically on mixture models and the MNIST dataset. |
spellingShingle | Paige, B Sejdinovic, D Wood, F Super-sampling with a reservoir |
title | Super-sampling with a reservoir |
title_full | Super-sampling with a reservoir |
title_fullStr | Super-sampling with a reservoir |
title_full_unstemmed | Super-sampling with a reservoir |
title_short | Super-sampling with a reservoir |
title_sort | super sampling with a reservoir |
work_keys_str_mv | AT paigeb supersamplingwithareservoir AT sejdinovicd supersamplingwithareservoir AT woodf supersamplingwithareservoir |