Super-sampling with a reservoir

We introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall ap...

Ful tanımlama

Detaylı Bibliyografya
Asıl Yazarlar: Paige, B, Sejdinovic, D, Wood, F
Materyal Türü: Conference item
Baskı/Yayın Bilgisi: AUAI Press 2016
_version_ 1826275714638282752
author Paige, B
Sejdinovic, D
Wood, F
author_facet Paige, B
Sejdinovic, D
Wood, F
author_sort Paige, B
collection OXFORD
description We introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall approximation to the full data set, as judged using a kernel two-sample test. This produces subsets which minimize the worst-case relative error when computing expectations of functions in a specified function class, using just the samples from the subset. Kernel functions are approximated using random Fourier features, and the subset of samples itself is stored in a random projection tree. The resulting algorithm runs in a single pass through the whole data set, and has a per-iteration computational complexity logarithmic in the size of the subset. These “supersamples” subsampled from the full data provide a concise summary, as demonstrated empirically on mixture models and the MNIST dataset.
first_indexed 2024-03-06T23:02:59Z
format Conference item
id oxford-uuid:62cf0b18-df18-4729-bd1f-21a9e834e2b9
institution University of Oxford
last_indexed 2024-03-06T23:02:59Z
publishDate 2016
publisher AUAI Press
record_format dspace
spelling oxford-uuid:62cf0b18-df18-4729-bd1f-21a9e834e2b92022-03-26T18:08:40ZSuper-sampling with a reservoirConference itemhttp://purl.org/coar/resource_type/c_5794uuid:62cf0b18-df18-4729-bd1f-21a9e834e2b9Symplectic Elements at OxfordAUAI Press2016Paige, BSejdinovic, DWood, FWe introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall approximation to the full data set, as judged using a kernel two-sample test. This produces subsets which minimize the worst-case relative error when computing expectations of functions in a specified function class, using just the samples from the subset. Kernel functions are approximated using random Fourier features, and the subset of samples itself is stored in a random projection tree. The resulting algorithm runs in a single pass through the whole data set, and has a per-iteration computational complexity logarithmic in the size of the subset. These “supersamples” subsampled from the full data provide a concise summary, as demonstrated empirically on mixture models and the MNIST dataset.
spellingShingle Paige, B
Sejdinovic, D
Wood, F
Super-sampling with a reservoir
title Super-sampling with a reservoir
title_full Super-sampling with a reservoir
title_fullStr Super-sampling with a reservoir
title_full_unstemmed Super-sampling with a reservoir
title_short Super-sampling with a reservoir
title_sort super sampling with a reservoir
work_keys_str_mv AT paigeb supersamplingwithareservoir
AT sejdinovicd supersamplingwithareservoir
AT woodf supersamplingwithareservoir