An analysis framework for clustering algorithm selection with applications to spectroscopy.

Cluster analysis is a valuable unsupervised machine learning technique that is applied in a multitude of domains to identify similarities or clusters in unlabelled data. However, its performance is dependent of the characteristics of the data it is being applied to. There is no universally best clus...

Full description

Bibliographic Details
Main Authors: Simon Crase, Suresh N Thennadil
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0266369
_version_ 1811288671123406848
author Simon Crase
Suresh N Thennadil
author_facet Simon Crase
Suresh N Thennadil
author_sort Simon Crase
collection DOAJ
description Cluster analysis is a valuable unsupervised machine learning technique that is applied in a multitude of domains to identify similarities or clusters in unlabelled data. However, its performance is dependent of the characteristics of the data it is being applied to. There is no universally best clustering algorithm, and hence, there are numerous clustering algorithms available with different performance characteristics. This raises the problem of how to select an appropriate clustering algorithm for the given analytical purposes. We present and validate an analysis framework to address this problem. Unlike most current literature which focuses on characterizing the clustering algorithm itself, we present a wider holistic approach, with a focus on the user's needs, the data's characteristics and the characteristics of the clusters it may contain. In our analysis framework, we utilize a softer qualitative approach to identify appropriate characteristics for consideration when matching clustering algorithms to the intended application. These are used to generate a small subset of suitable clustering algorithms whose performance are then evaluated utilizing quantitative cluster validity indices. To validate our analysis framework for selecting clustering algorithms, we applied it to four different types of datasets: three datasets of homemade explosives spectroscopy, eight datasets of publicly available spectroscopy data covering food and biomedical applications, a gene expression cancer dataset, and three classic machine learning datasets. Each data type has discernible differences in the composition of the data and the context within which they are used. Our analysis framework, when applied to each of these challenges, recommended differing subsets of clustering algorithms for final quantitative performance evaluation. For each application, the recommended clustering algorithms were confirmed to contain the top performing algorithms through quantitative performance indices.
first_indexed 2024-04-13T03:41:15Z
format Article
id doaj.art-5de6729153364990bc23848fbbf6aadc
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-13T03:41:15Z
publishDate 2022-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-5de6729153364990bc23848fbbf6aadc2022-12-22T03:04:08ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01173e026636910.1371/journal.pone.0266369An analysis framework for clustering algorithm selection with applications to spectroscopy.Simon CraseSuresh N ThennadilCluster analysis is a valuable unsupervised machine learning technique that is applied in a multitude of domains to identify similarities or clusters in unlabelled data. However, its performance is dependent of the characteristics of the data it is being applied to. There is no universally best clustering algorithm, and hence, there are numerous clustering algorithms available with different performance characteristics. This raises the problem of how to select an appropriate clustering algorithm for the given analytical purposes. We present and validate an analysis framework to address this problem. Unlike most current literature which focuses on characterizing the clustering algorithm itself, we present a wider holistic approach, with a focus on the user's needs, the data's characteristics and the characteristics of the clusters it may contain. In our analysis framework, we utilize a softer qualitative approach to identify appropriate characteristics for consideration when matching clustering algorithms to the intended application. These are used to generate a small subset of suitable clustering algorithms whose performance are then evaluated utilizing quantitative cluster validity indices. To validate our analysis framework for selecting clustering algorithms, we applied it to four different types of datasets: three datasets of homemade explosives spectroscopy, eight datasets of publicly available spectroscopy data covering food and biomedical applications, a gene expression cancer dataset, and three classic machine learning datasets. Each data type has discernible differences in the composition of the data and the context within which they are used. Our analysis framework, when applied to each of these challenges, recommended differing subsets of clustering algorithms for final quantitative performance evaluation. For each application, the recommended clustering algorithms were confirmed to contain the top performing algorithms through quantitative performance indices.https://doi.org/10.1371/journal.pone.0266369
spellingShingle Simon Crase
Suresh N Thennadil
An analysis framework for clustering algorithm selection with applications to spectroscopy.
PLoS ONE
title An analysis framework for clustering algorithm selection with applications to spectroscopy.
title_full An analysis framework for clustering algorithm selection with applications to spectroscopy.
title_fullStr An analysis framework for clustering algorithm selection with applications to spectroscopy.
title_full_unstemmed An analysis framework for clustering algorithm selection with applications to spectroscopy.
title_short An analysis framework for clustering algorithm selection with applications to spectroscopy.
title_sort analysis framework for clustering algorithm selection with applications to spectroscopy
url https://doi.org/10.1371/journal.pone.0266369
work_keys_str_mv AT simoncrase ananalysisframeworkforclusteringalgorithmselectionwithapplicationstospectroscopy
AT sureshnthennadil ananalysisframeworkforclusteringalgorithmselectionwithapplicationstospectroscopy
AT simoncrase analysisframeworkforclusteringalgorithmselectionwithapplicationstospectroscopy
AT sureshnthennadil analysisframeworkforclusteringalgorithmselectionwithapplicationstospectroscopy