Deterministic Coresets for k-Means of Big Sparse Data

Let P be a set of n points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </in...

Full description

Bibliographic Details
Main Authors:	Artem Barger, Dan Feldman
Format:	Article
Language:	English
Published:	MDPI AG 2020-04-01
Series:	Algorithms
Subjects:	coreset clustering KMeans big data streaming
Online Access:	https://www.mdpi.com/1999-4893/13/4/92

_version_	1827718701803110400
author	Artem Barger Dan Feldman
author_facet	Artem Barger Dan Feldman
author_sort	Artem Barger
collection	DOAJ
description	Let <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </inline-formula>, <inline-formula> <math display="inline"> <semantics> <mrow> <mi>k</mi> <mo>≥</mo> <mn>1</mn> </mrow> </semantics> </math> </inline-formula> be an integer and <inline-formula> <math display="inline"> <semantics> <mrow> <mi>ε</mi> <mo>∈</mo> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics> </math> </inline-formula> be a constant. An <i>ε-coreset</i> is a subset <inline-formula> <math display="inline"> <semantics> <mrow> <mi>C</mi> <mo>⊆</mo> <mi>P</mi> </mrow> </semantics> </math> </inline-formula> with appropriate non-negative weights (scalars), that approximates <i>any</i> given set <inline-formula> <math display="inline"> <semantics> <mrow> <mi>Q</mi> <mo>⊆</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </mrow> </semantics> </math> </inline-formula> of <i>k</i> centers. That is, the sum of squared distances over every point in <i>P</i> to its closest point in <i>Q</i> is the same, up to a factor of <inline-formula> <math display="inline"> <semantics> <mrow> <mn>1</mn> <mo>±</mo> <mi>ε</mi> </mrow> </semantics> </math> </inline-formula> to the weighted sum of <i>C</i> to the same <i>k</i> centers. If the coreset is small, we can solve problems such as <i>k</i>-means clustering or its variants (e.g., discrete <i>k</i>-means, where the centers are restricted to be in <i>P</i>, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension <i>d</i>. We suggest the first such coreset of size <i>independent</i> of <i>d</i>. This is also the first deterministic coreset construction whose resulting size is not exponential in <i>d</i>. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.
first_indexed	2024-03-10T20:28:08Z
format	Article
id	doaj.art-6f6f610648e947ec808e401b0711cec8
institution	Directory Open Access Journal
issn	1999-4893
language	English
last_indexed	2024-03-10T20:28:08Z
publishDate	2020-04-01
publisher	MDPI AG
record_format	Article
series	Algorithms
spelling	doaj.art-6f6f610648e947ec808e401b0711cec82023-11-19T21:36:03ZengMDPI AGAlgorithms1999-48932020-04-011349210.3390/a13040092Deterministic Coresets for <em>k</em>-Means of Big Sparse DataArtem Barger0Dan Feldman1Department of Computer, Science University of Haifa, Haifa 32000, IsraelDepartment of Computer, Science University of Haifa, Haifa 32000, IsraelLet <i>P</i> be a set of <i>n</i> points in <inline-formula> <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </semantics> </math> </inline-formula>, <inline-formula> <math display="inline"> <semantics> <mrow> <mi>k</mi> <mo>≥</mo> <mn>1</mn> </mrow> </semantics> </math> </inline-formula> be an integer and <inline-formula> <math display="inline"> <semantics> <mrow> <mi>ε</mi> <mo>∈</mo> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics> </math> </inline-formula> be a constant. An <i>ε-coreset</i> is a subset <inline-formula> <math display="inline"> <semantics> <mrow> <mi>C</mi> <mo>⊆</mo> <mi>P</mi> </mrow> </semantics> </math> </inline-formula> with appropriate non-negative weights (scalars), that approximates <i>any</i> given set <inline-formula> <math display="inline"> <semantics> <mrow> <mi>Q</mi> <mo>⊆</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>d</mi> </msup> </mrow> </semantics> </math> </inline-formula> of <i>k</i> centers. That is, the sum of squared distances over every point in <i>P</i> to its closest point in <i>Q</i> is the same, up to a factor of <inline-formula> <math display="inline"> <semantics> <mrow> <mn>1</mn> <mo>±</mo> <mi>ε</mi> </mrow> </semantics> </math> </inline-formula> to the weighted sum of <i>C</i> to the same <i>k</i> centers. If the coreset is small, we can solve problems such as <i>k</i>-means clustering or its variants (e.g., discrete <i>k</i>-means, where the centers are restricted to be in <i>P</i>, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension <i>d</i>. We suggest the first such coreset of size <i>independent</i> of <i>d</i>. This is also the first deterministic coreset construction whose resulting size is not exponential in <i>d</i>. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.https://www.mdpi.com/1999-4893/13/4/92coresetclusteringKMeansbig datastreaming
spellingShingle	Artem Barger Dan Feldman Deterministic Coresets for <em>k</em>-Means of Big Sparse Data Algorithms coreset clustering KMeans big data streaming
title	Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_full	Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_fullStr	Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_full_unstemmed	Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_short	Deterministic Coresets for <em>k</em>-Means of Big Sparse Data
title_sort	deterministic coresets for em k em means of big sparse data
topic	coreset clustering KMeans big data streaming
url	https://www.mdpi.com/1999-4893/13/4/92
work_keys_str_mv	AT artembarger deterministiccoresetsforemkemmeansofbigsparsedata AT danfeldman deterministiccoresetsforemkemmeansofbigsparsedata

Deterministic Coresets for <em>k</em>-Means of Big Sparse Data

Similar Items