CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
(Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure whil...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-10-01
|
Series: | Journal of King Saud University: Computer and Information Sciences |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S1319157823002859 |
_version_ | 1827763960309350400 |
---|---|
author | Zhaozhao Xu Fangyuan Yang Hong Wang Junding Sun Hengde Zhu Shuihua Wang Yudong Zhang |
author_facet | Zhaozhao Xu Fangyuan Yang Hong Wang Junding Sun Hengde Zhu Shuihua Wang Yudong Zhang |
author_sort | Zhaozhao Xu |
collection | DOAJ |
description | (Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms. |
first_indexed | 2024-03-11T10:58:05Z |
format | Article |
id | doaj.art-46a3e850623341f5a21d822188256445 |
institution | Directory Open Access Journal |
issn | 1319-1578 |
language | English |
last_indexed | 2024-03-11T10:58:05Z |
publishDate | 2023-10-01 |
publisher | Elsevier |
record_format | Article |
series | Journal of King Saud University: Computer and Information Sciences |
spelling | doaj.art-46a3e850623341f5a21d8221882564452023-11-13T04:08:54ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782023-10-01359101731CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression dataZhaozhao Xu0Fangyuan Yang1Hong Wang2Junding Sun3Hengde Zhu4Shuihua Wang5Yudong Zhang6School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454000, ChinaDepartment of Gynecologic Oncology, The First Affiliated Hospital of Henan Polytechnic University, Jiaozuo, Henan 454000, ChinaDepartment of Gynecologic Oncology, The First Affiliated Hospital of Henan Polytechnic University, Jiaozuo, Henan 454000, ChinaSchool of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454000, China; Corresponding authors.School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UKSchool of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UKSchool of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK; Corresponding authors.(Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.http://www.sciencedirect.com/science/article/pii/S1319157823002859Gene expression dataClustering-guidedUnsupervised feature selectionk-meansSpectral clustering |
spellingShingle | Zhaozhao Xu Fangyuan Yang Hong Wang Junding Sun Hengde Zhu Shuihua Wang Yudong Zhang CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data Journal of King Saud University: Computer and Information Sciences Gene expression data Clustering-guided Unsupervised feature selection k-means Spectral clustering |
title | CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data |
title_full | CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data |
title_fullStr | CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data |
title_full_unstemmed | CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data |
title_short | CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data |
title_sort | cgufs a clustering guided unsupervised feature selection algorithm for gene expression data |
topic | Gene expression data Clustering-guided Unsupervised feature selection k-means Spectral clustering |
url | http://www.sciencedirect.com/science/article/pii/S1319157823002859 |
work_keys_str_mv | AT zhaozhaoxu cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata AT fangyuanyang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata AT hongwang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata AT jundingsun cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata AT hengdezhu cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata AT shuihuawang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata AT yudongzhang cgufsaclusteringguidedunsupervisedfeatureselectionalgorithmforgeneexpressiondata |