Shape complexity in cluster analysis.

In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the...

Full description

Bibliographic Details
Main Authors:	Eduardo J Aguilar, Valmir C Barbosa
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2023-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0286312

_version_	1797799806283808768
author	Eduardo J Aguilar Valmir C Barbosa
author_facet	Eduardo J Aguilar Valmir C Barbosa
author_sort	Eduardo J Aguilar
collection	DOAJ
description	In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
first_indexed	2024-03-13T04:25:31Z
format	Article
id	doaj.art-140d489a278d46e19ae3894a0953ce11
institution	Directory Open Access Journal
issn	1932-6203
language	English
last_indexed	2024-03-13T04:25:31Z
publishDate	2023-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj.art-140d489a278d46e19ae3894a0953ce112023-06-20T05:31:19ZengPublic Library of Science (PLoS)PLoS ONE1932-62032023-01-01185e028631210.1371/journal.pone.0286312Shape complexity in cluster analysis.Eduardo J AguilarValmir C BarbosaIn cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.https://doi.org/10.1371/journal.pone.0286312
spellingShingle	Eduardo J Aguilar Valmir C Barbosa Shape complexity in cluster analysis. PLoS ONE
title	Shape complexity in cluster analysis.
title_full	Shape complexity in cluster analysis.
title_fullStr	Shape complexity in cluster analysis.
title_full_unstemmed	Shape complexity in cluster analysis.
title_short	Shape complexity in cluster analysis.
title_sort	shape complexity in cluster analysis
url	https://doi.org/10.1371/journal.pone.0286312
work_keys_str_mv	AT eduardojaguilar shapecomplexityinclusteranalysis AT valmircbarbosa shapecomplexityinclusteranalysis

Shape complexity in cluster analysis.

Similar Items