What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately...

Full description

Bibliographic Details
Main Authors:	Raykov, Yordan P., Boukouvalas, Alexis, Baig, Fahd, Little, Max
Other Authors:	Program in Media Arts and Sciences (Massachusetts Institute of Technology)
Format:	Article
Language:	en_US
Published:	Public Library of Science 2017
Online Access:	http://hdl.handle.net/1721.1/109129

_version_	1811082286661107712
author	Raykov, Yordan P. Boukouvalas, Alexis Baig, Fahd Little, Max
author2	Program in Media Arts and Sciences (Massachusetts Institute of Technology)
author_facet	Program in Media Arts and Sciences (Massachusetts Institute of Technology) Raykov, Yordan P. Boukouvalas, Alexis Baig, Fahd Little, Max
author_sort	Raykov, Yordan P.
collection	MIT
description	The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.
first_indexed	2024-09-23T12:00:42Z
format	Article
id	mit-1721.1/109129
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T12:00:42Z
publishDate	2017
publisher	Public Library of Science
record_format	dspace
spelling	mit-1721.1/1091292022-10-01T07:36:26Z What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm Raykov, Yordan P. Boukouvalas, Alexis Baig, Fahd Little, Max Program in Media Arts and Sciences (Massachusetts Institute of Technology) Little, Max The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism. 2017-05-16T18:43:38Z 2017-05-16T18:43:38Z 2016-09 2016-01 Article http://purl.org/eprint/type/JournalArticle 1932-6203 http://hdl.handle.net/1721.1/109129 Raykov, Yordan P.; Boukouvalas, Alexis; Baig, Fahd and Little, Max A. “What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm.” Edited by Byung-Jun Yoon. PLOS ONE 11, no. 9 (September 2016): e0162259. © 2016 Raykov et al en_US http://dx.doi.org/10.1371/journal.pone.0162259 PLOS ONE Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/ application/pdf Public Library of Science PLoS
spellingShingle	Raykov, Yordan P. Boukouvalas, Alexis Baig, Fahd Little, Max What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title	What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_full	What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_fullStr	What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_full_unstemmed	What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_short	What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_sort	what to do when k means clustering fails a simple yet principled alternative algorithm
url	http://hdl.handle.net/1721.1/109129
work_keys_str_mv	AT raykovyordanp whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm AT boukouvalasalexis whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm AT baigfahd whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm AT littlemax whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm

What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm

Similar Items