CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data

(Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure whil...

Full description

Bibliographic Details
Main Authors: Zhaozhao Xu, Fangyuan Yang, Hong Wang, Junding Sun, Hengde Zhu, Shuihua Wang, Yudong Zhang
Format: Article
Language:English
Published: Elsevier 2023-10-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1319157823002859
_version_ 1827763960309350400
author Zhaozhao Xu
Fangyuan Yang
Hong Wang
Junding Sun
Hengde Zhu
Shuihua Wang
Yudong Zhang
author_facet Zhaozhao Xu
Fangyuan Yang
Hong Wang
Junding Sun
Hengde Zhu
Shuihua Wang
Yudong Zhang
author_sort Zhaozhao Xu
collection DOAJ
description (Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.
first_indexed 2024-03-11T10:58:05Z
format Article
id doaj.art-46a3e850623341f5a21d822188256445
institution Directory Open Access Journal
issn 1319-1578
language English
last_indexed 2024-03-11T10:58:05Z
publishDate 2023-10-01
publisher Elsevier
record_format Article
series Journal of King Saud University: Computer and Information Sciences
spelling doaj.art-46a3e850623341f5a21d8221882564452023-11-13T04:08:54ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782023-10-01359101731CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression dataZhaozhao Xu0Fangyuan Yang1Hong Wang2Junding Sun3Hengde Zhu4Shuihua Wang5Yudong Zhang6School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454000, ChinaDepartment of Gynecologic Oncology, The First Affiliated Hospital of Henan Polytechnic University, Jiaozuo, Henan 454000, ChinaDepartment of Gynecologic Oncology, The First Affiliated Hospital of Henan Polytechnic University, Jiaozuo, Henan 454000, ChinaSchool of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454000, China; Corresponding authors.School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UKSchool of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UKSchool of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK; Corresponding authors.(Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.http://www.sciencedirect.com/science/article/pii/S1319157823002859Gene expression dataClustering-guidedUnsupervised feature selectionk-meansSpectral clustering
spellingShingle Zhaozhao Xu
Fangyuan Yang
Hong Wang
Junding Sun
Hengde Zhu
Shuihua Wang
Yudong Zhang
CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
Journal of King Saud University: Computer and Information Sciences
Gene expression data
Clustering-guided
Unsupervised feature selection
k-means
Spectral clustering
title CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
title_full CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
title_fullStr CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
title_full_unstemmed CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
title_short CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
title_sort cgufs a clustering guided unsupervised feature selection algorithm for gene expression data
topic Gene expression data
Clustering-guided
Unsupervised feature selection
k-means
Spectral clustering
url http://www.sciencedirect.com/science/article/pii/S1319157823002859
work_keys_str_mv AT zhaozhaoxu cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata
AT fangyuanyang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata
AT hongwang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata
AT jundingsun cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata
AT hengdezhu cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata
AT shuihuawang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata
AT yudongzhang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata