Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival
Lung cancer is a global health challenge, hindered by delayed diagnosis and the disease’s complex molecular landscape. Accurate patient survival prediction is critical, motivating the exploration of various -omics datasets using machine learning methods. Leveraging multi-omics data, this study seeks...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-03-01
|
Series: | International Journal of Molecular Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/1422-0067/25/7/3661 |
_version_ | 1797212532297957376 |
---|---|
author | Roman Jaksik Kamila Szumała Khanh Ngoc Dinh Jarosław Śmieja |
author_facet | Roman Jaksik Kamila Szumała Khanh Ngoc Dinh Jarosław Śmieja |
author_sort | Roman Jaksik |
collection | DOAJ |
description | Lung cancer is a global health challenge, hindered by delayed diagnosis and the disease’s complex molecular landscape. Accurate patient survival prediction is critical, motivating the exploration of various -omics datasets using machine learning methods. Leveraging multi-omics data, this study seeks to enhance the accuracy of survival prediction by proposing new feature extraction techniques combined with unbiased feature selection. Two lung adenocarcinoma multi-omics datasets, originating from the TCGA and CPTAC-3 projects, were employed for this purpose, emphasizing gene expression, methylation, and mutations as the most relevant data sources that provide features for the survival prediction models. Additionally, gene set aggregation was shown to be the most effective feature extraction method for mutation and copy number variation data. Using the TCGA dataset, we identified 32 molecular features that allowed the construction of a 2-year survival prediction model with an AUC of 0.839. The selected features were additionally tested on an independent CPTAC-3 dataset, achieving an AUC of 0.815 in nested cross-validation, which confirmed the robustness of the identified features. |
first_indexed | 2024-04-24T10:43:53Z |
format | Article |
id | doaj.art-e8bac40c774440739c5a5c3e998f6aa0 |
institution | Directory Open Access Journal |
issn | 1661-6596 1422-0067 |
language | English |
last_indexed | 2024-04-24T10:43:53Z |
publishDate | 2024-03-01 |
publisher | MDPI AG |
record_format | Article |
series | International Journal of Molecular Sciences |
spelling | doaj.art-e8bac40c774440739c5a5c3e998f6aa02024-04-12T13:19:23ZengMDPI AGInternational Journal of Molecular Sciences1661-65961422-00672024-03-01257366110.3390/ijms25073661Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer SurvivalRoman Jaksik0Kamila Szumała1Khanh Ngoc Dinh2Jarosław Śmieja3Department of Systems Biology and Engineering, Silesian University of Technology, 44-100 Gliwice, PolandFaculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100 Gliwice, PolandIrving Institute for Cancer Dynamics and Department of Statistics, Columbia University, New York, NY 10027, USADepartment of Systems Biology and Engineering, Silesian University of Technology, 44-100 Gliwice, PolandLung cancer is a global health challenge, hindered by delayed diagnosis and the disease’s complex molecular landscape. Accurate patient survival prediction is critical, motivating the exploration of various -omics datasets using machine learning methods. Leveraging multi-omics data, this study seeks to enhance the accuracy of survival prediction by proposing new feature extraction techniques combined with unbiased feature selection. Two lung adenocarcinoma multi-omics datasets, originating from the TCGA and CPTAC-3 projects, were employed for this purpose, emphasizing gene expression, methylation, and mutations as the most relevant data sources that provide features for the survival prediction models. Additionally, gene set aggregation was shown to be the most effective feature extraction method for mutation and copy number variation data. Using the TCGA dataset, we identified 32 molecular features that allowed the construction of a 2-year survival prediction model with an AUC of 0.839. The selected features were additionally tested on an independent CPTAC-3 dataset, achieving an AUC of 0.815 in nested cross-validation, which confirmed the robustness of the identified features.https://www.mdpi.com/1422-0067/25/7/3661multiomics datafeature selectionfeature extractionmachine learningnext-generation sequencinglung cancer |
spellingShingle | Roman Jaksik Kamila Szumała Khanh Ngoc Dinh Jarosław Śmieja Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival International Journal of Molecular Sciences multiomics data feature selection feature extraction machine learning next-generation sequencing lung cancer |
title | Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival |
title_full | Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival |
title_fullStr | Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival |
title_full_unstemmed | Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival |
title_short | Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival |
title_sort | multiomics based feature extraction and selection for the prediction of lung cancer survival |
topic | multiomics data feature selection feature extraction machine learning next-generation sequencing lung cancer |
url | https://www.mdpi.com/1422-0067/25/7/3661 |
work_keys_str_mv | AT romanjaksik multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival AT kamilaszumała multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival AT khanhngocdinh multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival AT jarosławsmieja multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival |