Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
Let <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </in...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-04-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/13/4/92 |
_version_ | 1827718701803110400 |
---|---|
author | Artem Barger Dan Feldman |
author_facet | Artem Barger Dan Feldman |
author_sort | Artem Barger |
collection | DOAJ |
description | Let <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </inline-formula>, <inline-formula> <math display="inline"> <semantics> <mrow> <mi>k</mi> <mo>≥</mo> <mn>1</mn> </mrow> </semantics> </math> </inline-formula> be an integer and <inline-formula> <math display="inline"> <semantics> <mrow> <mi>ε</mi> <mo>∈</mo> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics> </math> </inline-formula> be a constant. An <i>ε-coreset</i> is a subset <inline-formula> <math display="inline"> <semantics> <mrow> <mi>C</mi> <mo>⊆</mo> <mi>P</mi> </mrow> </semantics> </math> </inline-formula> with appropriate non-negative weights (scalars), that approximates <i>any</i> given set <inline-formula> <math display="inline"> <semantics> <mrow> <mi>Q</mi> <mo>⊆</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </mrow> </semantics> </math> </inline-formula> of <i>k</i> centers. That is, the sum of squared distances over every point in <i>P</i> to its closest point in <i>Q</i> is the same, up to a factor of <inline-formula> <math display="inline"> <semantics> <mrow> <mn>1</mn> <mo>±</mo> <mi>ε</mi> </mrow> </semantics> </math> </inline-formula> to the weighted sum of <i>C</i> to the same <i>k</i> centers. If the coreset is small, we can solve problems such as <i>k</i>-means clustering or its variants (e.g., discrete <i>k</i>-means, where the centers are restricted to be in <i>P</i>, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension <i>d</i>. We suggest the first such coreset of size <i>independent</i> of <i>d</i>. This is also the first deterministic coreset construction whose resulting size is not exponential in <i>d</i>. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud. |
first_indexed | 2024-03-10T20:28:08Z |
format | Article |
id | doaj.art-6f6f610648e947ec808e401b0711cec8 |
institution | Directory Open Access Journal |
issn | 1999-4893 |
language | English |
last_indexed | 2024-03-10T20:28:08Z |
publishDate | 2020-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Algorithms |
spelling | doaj.art-6f6f610648e947ec808e401b0711cec82023-11-19T21:36:03ZengMDPI AGAlgorithms1999-48932020-04-011349210.3390/a13040092Deterministic Coresets for <em>k</em>-Means of Big Sparse DataArtem Barger0Dan Feldman1Department of Computer, Science University of Haifa, Haifa 32000, IsraelDepartment of Computer, Science University of Haifa, Haifa 32000, IsraelLet <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </inline-formula>, <inline-formula> <math display="inline"> <semantics> <mrow> <mi>k</mi> <mo>≥</mo> <mn>1</mn> </mrow> </semantics> </math> </inline-formula> be an integer and <inline-formula> <math display="inline"> <semantics> <mrow> <mi>ε</mi> <mo>∈</mo> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics> </math> </inline-formula> be a constant. An <i>ε-coreset</i> is a subset <inline-formula> <math display="inline"> <semantics> <mrow> <mi>C</mi> <mo>⊆</mo> <mi>P</mi> </mrow> </semantics> </math> </inline-formula> with appropriate non-negative weights (scalars), that approximates <i>any</i> given set <inline-formula> <math display="inline"> <semantics> <mrow> <mi>Q</mi> <mo>⊆</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </mrow> </semantics> </math> </inline-formula> of <i>k</i> centers. That is, the sum of squared distances over every point in <i>P</i> to its closest point in <i>Q</i> is the same, up to a factor of <inline-formula> <math display="inline"> <semantics> <mrow> <mn>1</mn> <mo>±</mo> <mi>ε</mi> </mrow> </semantics> </math> </inline-formula> to the weighted sum of <i>C</i> to the same <i>k</i> centers. If the coreset is small, we can solve problems such as <i>k</i>-means clustering or its variants (e.g., discrete <i>k</i>-means, where the centers are restricted to be in <i>P</i>, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension <i>d</i>. We suggest the first such coreset of size <i>independent</i> of <i>d</i>. This is also the first deterministic coreset construction whose resulting size is not exponential in <i>d</i>. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.https://www.mdpi.com/1999-4893/13/4/92coresetclusteringKMeansbig datastreaming |
spellingShingle | Artem Barger Dan Feldman Deterministic Coresets for <em>k</em>-Means of Big Sparse Data Algorithms coreset clustering KMeans big data streaming |
title | Deterministic Coresets for <em>k</em>-Means of Big Sparse Data |
title_full | Deterministic Coresets for <em>k</em>-Means of Big Sparse Data |
title_fullStr | Deterministic Coresets for <em>k</em>-Means of Big Sparse Data |
title_full_unstemmed | Deterministic Coresets for <em>k</em>-Means of Big Sparse Data |
title_short | Deterministic Coresets for <em>k</em>-Means of Big Sparse Data |
title_sort | deterministic coresets for em k em means of big sparse data |
topic | coreset clustering KMeans big data streaming |
url | https://www.mdpi.com/1999-4893/13/4/92 |
work_keys_str_mv | AT artembarger deterministiccoresetsforemkemmeansofbigsparsedata AT danfeldman deterministiccoresetsforemkemmeansofbigsparsedata |