A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features

Clustering is a challenging problem in machine learning in which one attempts to group <inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> objects into <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math&g...

Full description

Bibliographic Details
Main Authors:	Shahina Rahman, Valen E. Johnson, Suhasini Subba Rao
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Clustering gram matrix high-dimensional features hyperparameter-free
Online Access:	https://ieeexplore.ieee.org/document/9934902/

_version_	1798018497398177792
author	Shahina Rahman Valen E. Johnson Suhasini Subba Rao
author_facet	Shahina Rahman Valen E. Johnson Suhasini Subba Rao
author_sort	Shahina Rahman
collection	DOAJ
description	Clustering is a challenging problem in machine learning in which one attempts to group <inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> objects into <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> groups based on <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula> features measured on each object. In this article, we examine the case where <inline-formula> <tex-math notation="LaTeX">$N \ll P$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> is not known. Clustering in such high dimensional, small sample size settings has numerous applications in biology, medicine, the social sciences, clinical trials, and other scientific and experimental fields. Whereas most existing clustering algorithms either require the number of clusters to be known a priori or are sensitive to the choice of tuning parameters, our method does not require the prior specification of <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> or any tuning parameters. This represents an important advantage for our method because training data are not available in the applications we consider (i.e., in unsupervised learning problems). Without training data, estimating <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> and other hyperparameters–and thus applying alternative clustering algorithms–can be difficult and lead to inaccurate results. Our method is based on a simple transformation of the Gram matrix and application of the strong law of large numbers to the transformed matrix. If the correlation between features decays as the number of features grows, we show that the transformed feature vectors concentrate tightly around their respective cluster expectations in a low-dimensional space. This result simplifies the detection and visualization of the unknown cluster configuration. We illustrate the algorithm by applying it to 32 benchmarked microarray datasets, each containing thousands of genomic features measured on a relatively small number of tissue samples. Compared to 21 other commonly used clustering methods, we find that the proposed algorithm is faster and twice as accurate in determining the “best” cluster configuration.
first_indexed	2024-04-11T16:25:01Z
format	Article
id	doaj.art-ec1c1b68053b46d19f306200afaad963
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-11T16:25:01Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-ec1c1b68053b46d19f306200afaad9632022-12-22T04:14:12ZengIEEEIEEE Access2169-35362022-01-011011684411685710.1109/ACCESS.2022.32188009934902A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional FeaturesShahina Rahman0https://orcid.org/0000-0002-8161-9993Valen E. Johnson1https://orcid.org/0000-0002-8659-4772Suhasini Subba Rao2https://orcid.org/0000-0002-6563-2389Department of Statistics, Texas A & M University, College Station, TX, USADepartment of Statistics, Texas A & M University, College Station, TX, USADepartment of Statistics, Texas A & M University, College Station, TX, USAClustering is a challenging problem in machine learning in which one attempts to group <inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> objects into <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> groups based on <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula> features measured on each object. In this article, we examine the case where <inline-formula> <tex-math notation="LaTeX">$N \ll P$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> is not known. Clustering in such high dimensional, small sample size settings has numerous applications in biology, medicine, the social sciences, clinical trials, and other scientific and experimental fields. Whereas most existing clustering algorithms either require the number of clusters to be known a priori or are sensitive to the choice of tuning parameters, our method does not require the prior specification of <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> or any tuning parameters. This represents an important advantage for our method because training data are not available in the applications we consider (i.e., in unsupervised learning problems). Without training data, estimating <inline-formula> <tex-math notation="LaTeX">$K_{0}$ </tex-math></inline-formula> and other hyperparameters–and thus applying alternative clustering algorithms–can be difficult and lead to inaccurate results. Our method is based on a simple transformation of the Gram matrix and application of the strong law of large numbers to the transformed matrix. If the correlation between features decays as the number of features grows, we show that the transformed feature vectors concentrate tightly around their respective cluster expectations in a low-dimensional space. This result simplifies the detection and visualization of the unknown cluster configuration. We illustrate the algorithm by applying it to 32 benchmarked microarray datasets, each containing thousands of genomic features measured on a relatively small number of tissue samples. Compared to 21 other commonly used clustering methods, we find that the proposed algorithm is faster and twice as accurate in determining the “best” cluster configuration.https://ieeexplore.ieee.org/document/9934902/Clusteringgram matrixhigh-dimensional featureshyperparameter-free
spellingShingle	Shahina Rahman Valen E. Johnson Suhasini Subba Rao A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features IEEE Access Clustering gram matrix high-dimensional features hyperparameter-free
title	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_full	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_fullStr	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_full_unstemmed	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_short	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_sort	hyperparameter free fast and efficient framework to detect clusters from limited samples based on ultra high dimensional features
topic	Clustering gram matrix high-dimensional features hyperparameter-free
url	https://ieeexplore.ieee.org/document/9934902/
work_keys_str_mv	AT shahinarahman ahyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT valenejohnson ahyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT suhasinisubbarao ahyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT shahinarahman hyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT valenejohnson hyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT suhasinisubbarao hyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures

A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features

Similar Items