Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm

Microarray technology has developed rapidly in recent years, producing a large number of ultra-high dimensional gene expression data. However, due to the huge sample size and dimension proportion of gene expression data, it is very challenging work to screen important genes from gene expression data...

Full description

Bibliographic Details
Main Authors: Xiwen Qin, Shuang Zhang, Dongmei Yin, Dongxue Chen, Xiaogang Dong
Format: Article
Language:English
Published: AIMS Press 2022-09-01
Series:Mathematical Biosciences and Engineering
Subjects:
Online Access:https://www.aimspress.com/article/doi/10.3934/mbe.2022641?viewType=HTML
_version_ 1811265351778828288
author Xiwen Qin
Shuang Zhang
Dongmei Yin
Dongxue Chen
Xiaogang Dong
author_facet Xiwen Qin
Shuang Zhang
Dongmei Yin
Dongxue Chen
Xiaogang Dong
author_sort Xiwen Qin
collection DOAJ
description Microarray technology has developed rapidly in recent years, producing a large number of ultra-high dimensional gene expression data. However, due to the huge sample size and dimension proportion of gene expression data, it is very challenging work to screen important genes from gene expression data. For small samples of high-dimensional biomedical data, this paper proposes a two-stage feature selection framework combining Wrapper, embedding and filtering to avoid the curse of dimensionality. The proposed framework uses weighted gene co-expression network (WGCNA), random forest and minimal redundancy maximal relevance (mRMR) for first stage feature selection. In the second stage, a new gene selection method based on the improved binary Salp Swarm Algorithm is proposed, which combines machine learning methods to adaptively select feature subsets suitable for classification algorithms. Finally, the classification accuracy is evaluated using six methods: lightGBM, RF, SVM, XGBoost, MLP and KNN. To verify the performance of the framework and the effectiveness of the proposed algorithm, the number of genes selected and the classification accuracy was compared with the other five intelligent optimization algorithms. The results show that the proposed framework achieves an accuracy equal to or higher than other advanced intelligent algorithms on 10 datasets, and achieves an accuracy of over 97.6% on all 10 datasets. This shows that the method proposed in this paper can solve the feature selection problem related to high-dimensional data, and the proposed framework has no data set limitation, and it can be applied to other fields involving feature selection.
first_indexed 2024-04-12T20:21:01Z
format Article
id doaj.art-fbd22ff5f55d40cab86ef4d59fa1dd5a
institution Directory Open Access Journal
issn 1551-0018
language English
last_indexed 2024-04-12T20:21:01Z
publishDate 2022-09-01
publisher AIMS Press
record_format Article
series Mathematical Biosciences and Engineering
spelling doaj.art-fbd22ff5f55d40cab86ef4d59fa1dd5a2022-12-22T03:17:59ZengAIMS PressMathematical Biosciences and Engineering1551-00182022-09-011912137471378110.3934/mbe.2022641Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm AlgorithmXiwen Qin0Shuang Zhang1Dongmei Yin2Dongxue Chen3Xiaogang Dong4School of Mathematics and Statistics, Changchun University of Technology, Changchun 130012, ChinaSchool of Mathematics and Statistics, Changchun University of Technology, Changchun 130012, ChinaSchool of Mathematics and Statistics, Changchun University of Technology, Changchun 130012, ChinaSchool of Mathematics and Statistics, Changchun University of Technology, Changchun 130012, ChinaSchool of Mathematics and Statistics, Changchun University of Technology, Changchun 130012, ChinaMicroarray technology has developed rapidly in recent years, producing a large number of ultra-high dimensional gene expression data. However, due to the huge sample size and dimension proportion of gene expression data, it is very challenging work to screen important genes from gene expression data. For small samples of high-dimensional biomedical data, this paper proposes a two-stage feature selection framework combining Wrapper, embedding and filtering to avoid the curse of dimensionality. The proposed framework uses weighted gene co-expression network (WGCNA), random forest and minimal redundancy maximal relevance (mRMR) for first stage feature selection. In the second stage, a new gene selection method based on the improved binary Salp Swarm Algorithm is proposed, which combines machine learning methods to adaptively select feature subsets suitable for classification algorithms. Finally, the classification accuracy is evaluated using six methods: lightGBM, RF, SVM, XGBoost, MLP and KNN. To verify the performance of the framework and the effectiveness of the proposed algorithm, the number of genes selected and the classification accuracy was compared with the other five intelligent optimization algorithms. The results show that the proposed framework achieves an accuracy equal to or higher than other advanced intelligent algorithms on 10 datasets, and achieves an accuracy of over 97.6% on all 10 datasets. This shows that the method proposed in this paper can solve the feature selection problem related to high-dimensional data, and the proposed framework has no data set limitation, and it can be applied to other fields involving feature selection.https://www.aimspress.com/article/doi/10.3934/mbe.2022641?viewType=HTMLhigh-dimensional datafeature selectionswarm intelligence optimization algorithmgene expression datacancer classification
spellingShingle Xiwen Qin
Shuang Zhang
Dongmei Yin
Dongxue Chen
Xiaogang Dong
Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm
Mathematical Biosciences and Engineering
high-dimensional data
feature selection
swarm intelligence optimization algorithm
gene expression data
cancer classification
title Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm
title_full Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm
title_fullStr Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm
title_full_unstemmed Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm
title_short Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm
title_sort two stage feature selection for classification of gene expression data based on an improved salp swarm algorithm
topic high-dimensional data
feature selection
swarm intelligence optimization algorithm
gene expression data
cancer classification
url https://www.aimspress.com/article/doi/10.3934/mbe.2022641?viewType=HTML
work_keys_str_mv AT xiwenqin twostagefeatureselectionforclassificationofgeneexpressiondatabasedonanimprovedsalpswarmalgorithm
AT shuangzhang twostagefeatureselectionforclassificationofgeneexpressiondatabasedonanimprovedsalpswarmalgorithm
AT dongmeiyin twostagefeatureselectionforclassificationofgeneexpressiondatabasedonanimprovedsalpswarmalgorithm
AT dongxuechen twostagefeatureselectionforclassificationofgeneexpressiondatabasedonanimprovedsalpswarmalgorithm
AT xiaogangdong twostagefeatureselectionforclassificationofgeneexpressiondatabasedonanimprovedsalpswarmalgorithm