A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data

Exploring features associated with the clinical outcome of interest is a rapidly advancing area of research. However, with contemporary sequencing technologies capable of identifying over thousands of genes per sample, there is a challenge in constructing efficient prediction models that balance acc...

Full description

Bibliographic Details
Main Authors: Phi Le, Xingyue Gong, Leah Ung, Hai Yang, Bridget P. Keenan, Li Zhang, Tao He
Format: Article
Language:English
Published: Frontiers Media S.A. 2024-03-01
Series:Frontiers in Systems Biology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fsysb.2024.1355595/full
_version_ 1797253990432374784
author Phi Le
Xingyue Gong
Leah Ung
Hai Yang
Bridget P. Keenan
Bridget P. Keenan
Li Zhang
Li Zhang
Li Zhang
Tao He
author_facet Phi Le
Xingyue Gong
Leah Ung
Hai Yang
Bridget P. Keenan
Bridget P. Keenan
Li Zhang
Li Zhang
Li Zhang
Tao He
author_sort Phi Le
collection DOAJ
description Exploring features associated with the clinical outcome of interest is a rapidly advancing area of research. However, with contemporary sequencing technologies capable of identifying over thousands of genes per sample, there is a challenge in constructing efficient prediction models that balance accuracy and resource utilization. To address this challenge, researchers have developed feature selection methods to enhance performance, reduce overfitting, and ensure resource efficiency. However, applying feature selection models to survival analysis, particularly in clinical datasets characterized by substantial censoring and limited sample sizes, introduces unique challenges. We propose a robust ensemble feature selection approach integrated with group Lasso to identify compelling features and evaluate its performance in predicting survival outcomes. Our approach consistently outperforms established models across various criteria through extensive simulations, demonstrating low false discovery rates, high sensitivity, and high stability. Furthermore, we applied the approach to a colorectal cancer dataset from The Cancer Genome Atlas, showcasing its effectiveness by generating a composite score based on the selected genes to correctly distinguish different subtypes of the patients. In summary, our proposed approach excels in selecting impactful features from high-dimensional data, yielding better outcomes compared to contemporary state-of-the-art models.
first_indexed 2024-04-24T21:42:50Z
format Article
id doaj.art-a845c2977dcf43798780ba07759f8b5f
institution Directory Open Access Journal
issn 2674-0702
language English
last_indexed 2024-04-24T21:42:50Z
publishDate 2024-03-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Systems Biology
spelling doaj.art-a845c2977dcf43798780ba07759f8b5f2024-03-21T05:14:00ZengFrontiers Media S.A.Frontiers in Systems Biology2674-07022024-03-01410.3389/fsysb.2024.13555951355595A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression dataPhi Le0Xingyue Gong1Leah Ung2Hai Yang3Bridget P. Keenan4Bridget P. Keenan5Li Zhang6Li Zhang7Li Zhang8Tao He9Division of Hematology/Oncology, Department of Medicine, University of California, San Francisco, San Francisco, CA, United StatesDepartment of Physiological Nursing, School of Nursing, University of California, San Francisco, San Francisco, CA, United StatesDivision of Hematology/Oncology, Department of Medicine, University of California, San Francisco, San Francisco, CA, United StatesDivision of Hematology/Oncology, Department of Medicine, University of California, San Francisco, San Francisco, CA, United StatesDivision of Hematology/Oncology, Department of Medicine, University of California, San Francisco, San Francisco, CA, United StatesHelen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, United StatesDivision of Hematology/Oncology, Department of Medicine, University of California, San Francisco, San Francisco, CA, United StatesHelen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, United StatesDepartment of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, United StatesDepartment of Mathematics, San Francisco State University, San Francisco, CA, United StatesExploring features associated with the clinical outcome of interest is a rapidly advancing area of research. However, with contemporary sequencing technologies capable of identifying over thousands of genes per sample, there is a challenge in constructing efficient prediction models that balance accuracy and resource utilization. To address this challenge, researchers have developed feature selection methods to enhance performance, reduce overfitting, and ensure resource efficiency. However, applying feature selection models to survival analysis, particularly in clinical datasets characterized by substantial censoring and limited sample sizes, introduces unique challenges. We propose a robust ensemble feature selection approach integrated with group Lasso to identify compelling features and evaluate its performance in predicting survival outcomes. Our approach consistently outperforms established models across various criteria through extensive simulations, demonstrating low false discovery rates, high sensitivity, and high stability. Furthermore, we applied the approach to a colorectal cancer dataset from The Cancer Genome Atlas, showcasing its effectiveness by generating a composite score based on the selected genes to correctly distinguish different subtypes of the patients. In summary, our proposed approach excels in selecting impactful features from high-dimensional data, yielding better outcomes compared to contemporary state-of-the-art models.https://www.frontiersin.org/articles/10.3389/fsysb.2024.1355595/fullcolorectal cancerensemble feature selectionhigh-dimensional datatime-to-event outcomepseudo variablesgroup lasso
spellingShingle Phi Le
Xingyue Gong
Leah Ung
Hai Yang
Bridget P. Keenan
Bridget P. Keenan
Li Zhang
Li Zhang
Li Zhang
Tao He
A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data
Frontiers in Systems Biology
colorectal cancer
ensemble feature selection
high-dimensional data
time-to-event outcome
pseudo variables
group lasso
title A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data
title_full A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data
title_fullStr A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data
title_full_unstemmed A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data
title_short A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data
title_sort robust ensemble feature selection approach to prioritize genes associated with survival outcome in high dimensional gene expression data
topic colorectal cancer
ensemble feature selection
high-dimensional data
time-to-event outcome
pseudo variables
group lasso
url https://www.frontiersin.org/articles/10.3389/fsysb.2024.1355595/full
work_keys_str_mv AT phile arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT xingyuegong arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT leahung arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT haiyang arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT bridgetpkeenan arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT bridgetpkeenan arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT lizhang arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT lizhang arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT lizhang arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT taohe arobustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT phile robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT xingyuegong robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT leahung robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT haiyang robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT bridgetpkeenan robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT bridgetpkeenan robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT lizhang robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT lizhang robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT lizhang robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata
AT taohe robustensemblefeatureselectionapproachtoprioritizegenesassociatedwithsurvivaloutcomeinhighdimensionalgeneexpressiondata