Streaming coreset constructions for M-estimators

© Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. We introduce a new method of maintaining a (k, ϵ)-coreset for clustering M-estimators over insertion-only streams. Let (P, w) be a weighted set (where w : P → [0, ∞) is the weight function) of points in a ρ-metric space (meaning a set X...

Full description

Bibliographic Details
Format: Article
Language:English
Published: 2021
Online Access:https://hdl.handle.net/1721.1/137648
_version_ 1811088137573629952
collection MIT
description © Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. We introduce a new method of maintaining a (k, ϵ)-coreset for clustering M-estimators over insertion-only streams. Let (P, w) be a weighted set (where w : P → [0, ∞) is the weight function) of points in a ρ-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x, z) ≤ ρ(D(x, y) + D(y, z)) for all x, y, z ∈ X). For any set of points C, we define COST(P, w, C) = ∑p∈P w(p) minc∈C D(p, c). A (k, ϵ)-coreset for (P, w) is a weighted set (Q, v) such that for every set C of k points, (1 − ϵ)COST(P, w, C) ≤ COST(Q, v, C) ≤ (1 + ϵ)COST(P, w, C). Essentially, the coreset (Q, v) can be used in place of (P, w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x, y) that can be written as ψ(d(x, y)) where (X, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (ψ(x) = x) and k-means (ψ(x) = x2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(ϵ−2k log k log n) points of storage. The previous state-of-the-art required storing at least O(ϵ−2k log k log4 n) points.
first_indexed 2024-09-23T13:56:51Z
format Article
id mit-1721.1/137648
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T13:56:51Z
publishDate 2021
record_format dspace
spelling mit-1721.1/1376482022-04-01T17:27:46Z Streaming coreset constructions for M-estimators © Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. We introduce a new method of maintaining a (k, ϵ)-coreset for clustering M-estimators over insertion-only streams. Let (P, w) be a weighted set (where w : P → [0, ∞) is the weight function) of points in a ρ-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x, z) ≤ ρ(D(x, y) + D(y, z)) for all x, y, z ∈ X). For any set of points C, we define COST(P, w, C) = ∑p∈P w(p) minc∈C D(p, c). A (k, ϵ)-coreset for (P, w) is a weighted set (Q, v) such that for every set C of k points, (1 − ϵ)COST(P, w, C) ≤ COST(Q, v, C) ≤ (1 + ϵ)COST(P, w, C). Essentially, the coreset (Q, v) can be used in place of (P, w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x, y) that can be written as ψ(d(x, y)) where (X, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (ψ(x) = x) and k-means (ψ(x) = x2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(ϵ−2k log k log n) points of storage. The previous state-of-the-art required storing at least O(ϵ−2k log k log4 n) points. 2021-11-08T13:28:36Z 2021-11-08T13:28:36Z 2019-09 2021-03-24T17:36:28Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/137648 2019. "Streaming coreset constructions for M-estimators." Leibniz International Proceedings in Informatics, LIPIcs, 145. en 10.4230/LIPIcs.APPROX-RANDOM.2019.62 Leibniz International Proceedings in Informatics, LIPIcs Creative Commons Attribution 4.0 International license https://creativecommons.org/licenses/by/4.0/ application/pdf DROPS
spellingShingle Streaming coreset constructions for M-estimators
title Streaming coreset constructions for M-estimators
title_full Streaming coreset constructions for M-estimators
title_fullStr Streaming coreset constructions for M-estimators
title_full_unstemmed Streaming coreset constructions for M-estimators
title_short Streaming coreset constructions for M-estimators
title_sort streaming coreset constructions for m estimators
url https://hdl.handle.net/1721.1/137648