Deterministic Coresets for <em>k</em>-Means of Big Sparse Data

Let <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </in...

Full description

Bibliographic Details
Main Authors: Artem Barger, Dan Feldman
Format: Article
Language:English
Published: MDPI AG 2020-04-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/13/4/92
_version_ 1827718701803110400
author Artem Barger
Dan Feldman
author_facet Artem Barger
Dan Feldman
author_sort Artem Barger
collection DOAJ
description Let <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </inline-formula>, <inline-formula> <math display="inline"> <semantics> <mrow> <mi>k</mi> <mo>≥</mo> <mn>1</mn> </mrow> </semantics> </math> </inline-formula> be an integer and <inline-formula> <math display="inline"> <semantics> <mrow> <mi>ε</mi> <mo>∈</mo> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics> </math> </inline-formula> be a constant. An <i>ε-coreset</i> is a subset <inline-formula> <math display="inline"> <semantics> <mrow> <mi>C</mi> <mo>⊆</mo> <mi>P</mi> </mrow> </semantics> </math> </inline-formula> with appropriate non-negative weights (scalars), that approximates <i>any</i> given set <inline-formula> <math display="inline"> <semantics> <mrow> <mi>Q</mi> <mo>⊆</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </mrow> </semantics> </math> </inline-formula> of <i>k</i> centers. That is, the sum of squared distances over every point in <i>P</i> to its closest point in <i>Q</i> is the same, up to a factor of <inline-formula> <math display="inline"> <semantics> <mrow> <mn>1</mn> <mo>±</mo> <mi>ε</mi> </mrow> </semantics> </math> </inline-formula> to the weighted sum of <i>C</i> to the same <i>k</i> centers. If the coreset is small, we can solve problems such as <i>k</i>-means clustering or its variants (e.g., discrete <i>k</i>-means, where the centers are restricted to be in <i>P</i>, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension <i>d</i>. We suggest the first such coreset of size <i>independent</i> of <i>d</i>. This is also the first deterministic coreset construction whose resulting size is not exponential in <i>d</i>. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.
first_indexed 2024-03-10T20:28:08Z
format Article
id doaj.art-6f6f610648e947ec808e401b0711cec8
institution Directory Open Access Journal
issn 1999-4893
language English
last_indexed 2024-03-10T20:28:08Z
publishDate 2020-04-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj.art-6f6f610648e947ec808e401b0711cec82023-11-19T21:36:03ZengMDPI AGAlgorithms1999-48932020-04-011349210.3390/a13040092Deterministic Coresets for <em>k</em>-Means of Big Sparse DataArtem Barger0Dan Feldman1Department of Computer, Science University of Haifa, Haifa 32000, IsraelDepartment of Computer, Science University of Haifa, Haifa 32000, IsraelLet <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </inline-formula>, <inline-formula> <math display="inline"> <semantics> <mrow> <mi>k</mi> <mo>≥</mo> <mn>1</mn> </mrow> </semantics> </math> </inline-formula> be an integer and <inline-formula> <math display="inline"> <semantics> <mrow> <mi>ε</mi> <mo>∈</mo> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics> </math> </inline-formula> be a constant. An <i>ε-coreset</i> is a subset <inline-formula> <math display="inline"> <semantics> <mrow> <mi>C</mi> <mo>⊆</mo> <mi>P</mi> </mrow> </semantics> </math> </inline-formula> with appropriate non-negative weights (scalars), that approximates <i>any</i> given set <inline-formula> <math display="inline"> <semantics> <mrow> <mi>Q</mi> <mo>⊆</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </mrow> </semantics> </math> </inline-formula> of <i>k</i> centers. That is, the sum of squared distances over every point in <i>P</i> to its closest point in <i>Q</i> is the same, up to a factor of <inline-formula> <math display="inline"> <semantics> <mrow> <mn>1</mn> <mo>±</mo> <mi>ε</mi> </mrow> </semantics> </math> </inline-formula> to the weighted sum of <i>C</i> to the same <i>k</i> centers. If the coreset is small, we can solve problems such as <i>k</i>-means clustering or its variants (e.g., discrete <i>k</i>-means, where the centers are restricted to be in <i>P</i>, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension <i>d</i>. We suggest the first such coreset of size <i>independent</i> of <i>d</i>. This is also the first deterministic coreset construction whose resulting size is not exponential in <i>d</i>. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.https://www.mdpi.com/1999-4893/13/4/92coresetclusteringKMeansbig datastreaming
spellingShingle Artem Barger
Dan Feldman
Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
Algorithms
coreset
clustering
KMeans
big data
streaming
title Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_full Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_fullStr Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_full_unstemmed Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_short Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_sort deterministic coresets for em k em means of big sparse data
topic coreset
clustering
KMeans
big data
streaming
url https://www.mdpi.com/1999-4893/13/4/92
work_keys_str_mv AT artembarger deterministiccoresetsforemkemmeansofbigsparsedata
AT danfeldman deterministiccoresetsforemkemmeansofbigsparsedata