Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival

Lung cancer is a global health challenge, hindered by delayed diagnosis and the disease’s complex molecular landscape. Accurate patient survival prediction is critical, motivating the exploration of various -omics datasets using machine learning methods. Leveraging multi-omics data, this study seeks...

Full description

Bibliographic Details
Main Authors: Roman Jaksik, Kamila Szumała, Khanh Ngoc Dinh, Jarosław Śmieja
Format: Article
Language:English
Published: MDPI AG 2024-03-01
Series:International Journal of Molecular Sciences
Subjects:
Online Access:https://www.mdpi.com/1422-0067/25/7/3661
_version_ 1797212532297957376
author Roman Jaksik
Kamila Szumała
Khanh Ngoc Dinh
Jarosław Śmieja
author_facet Roman Jaksik
Kamila Szumała
Khanh Ngoc Dinh
Jarosław Śmieja
author_sort Roman Jaksik
collection DOAJ
description Lung cancer is a global health challenge, hindered by delayed diagnosis and the disease’s complex molecular landscape. Accurate patient survival prediction is critical, motivating the exploration of various -omics datasets using machine learning methods. Leveraging multi-omics data, this study seeks to enhance the accuracy of survival prediction by proposing new feature extraction techniques combined with unbiased feature selection. Two lung adenocarcinoma multi-omics datasets, originating from the TCGA and CPTAC-3 projects, were employed for this purpose, emphasizing gene expression, methylation, and mutations as the most relevant data sources that provide features for the survival prediction models. Additionally, gene set aggregation was shown to be the most effective feature extraction method for mutation and copy number variation data. Using the TCGA dataset, we identified 32 molecular features that allowed the construction of a 2-year survival prediction model with an AUC of 0.839. The selected features were additionally tested on an independent CPTAC-3 dataset, achieving an AUC of 0.815 in nested cross-validation, which confirmed the robustness of the identified features.
first_indexed 2024-04-24T10:43:53Z
format Article
id doaj.art-e8bac40c774440739c5a5c3e998f6aa0
institution Directory Open Access Journal
issn 1661-6596
1422-0067
language English
last_indexed 2024-04-24T10:43:53Z
publishDate 2024-03-01
publisher MDPI AG
record_format Article
series International Journal of Molecular Sciences
spelling doaj.art-e8bac40c774440739c5a5c3e998f6aa02024-04-12T13:19:23ZengMDPI AGInternational Journal of Molecular Sciences1661-65961422-00672024-03-01257366110.3390/ijms25073661Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer SurvivalRoman Jaksik0Kamila Szumała1Khanh Ngoc Dinh2Jarosław Śmieja3Department of Systems Biology and Engineering, Silesian University of Technology, 44-100 Gliwice, PolandFaculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100 Gliwice, PolandIrving Institute for Cancer Dynamics and Department of Statistics, Columbia University, New York, NY 10027, USADepartment of Systems Biology and Engineering, Silesian University of Technology, 44-100 Gliwice, PolandLung cancer is a global health challenge, hindered by delayed diagnosis and the disease’s complex molecular landscape. Accurate patient survival prediction is critical, motivating the exploration of various -omics datasets using machine learning methods. Leveraging multi-omics data, this study seeks to enhance the accuracy of survival prediction by proposing new feature extraction techniques combined with unbiased feature selection. Two lung adenocarcinoma multi-omics datasets, originating from the TCGA and CPTAC-3 projects, were employed for this purpose, emphasizing gene expression, methylation, and mutations as the most relevant data sources that provide features for the survival prediction models. Additionally, gene set aggregation was shown to be the most effective feature extraction method for mutation and copy number variation data. Using the TCGA dataset, we identified 32 molecular features that allowed the construction of a 2-year survival prediction model with an AUC of 0.839. The selected features were additionally tested on an independent CPTAC-3 dataset, achieving an AUC of 0.815 in nested cross-validation, which confirmed the robustness of the identified features.https://www.mdpi.com/1422-0067/25/7/3661multiomics datafeature selectionfeature extractionmachine learningnext-generation sequencinglung cancer
spellingShingle Roman Jaksik
Kamila Szumała
Khanh Ngoc Dinh
Jarosław Śmieja
Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival
International Journal of Molecular Sciences
multiomics data
feature selection
feature extraction
machine learning
next-generation sequencing
lung cancer
title Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival
title_full Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival
title_fullStr Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival
title_full_unstemmed Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival
title_short Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival
title_sort multiomics based feature extraction and selection for the prediction of lung cancer survival
topic multiomics data
feature selection
feature extraction
machine learning
next-generation sequencing
lung cancer
url https://www.mdpi.com/1422-0067/25/7/3661
work_keys_str_mv AT romanjaksik multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival
AT kamilaszumała multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival
AT khanhngocdinh multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival
AT jarosławsmieja multiomicsbasedfeatureextractionandselectionforthepredictionoflungcancersurvival