Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
High-risk pediatric B-ALL patients experience 5-year negative event rates up to 25%. Although some biomarkers of relapse are utilized in the clinic, their ability to predict outcomes in high-risk patients is limited. Here, we propose a random survival forest (RSF) machine learning model utilizing in...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2022-01-01
|
Series: | Computational and Structural Biotechnology Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037022000058 |
_version_ | 1797978310696763392 |
---|---|
author | Zachary S. Bohannan Frederick Coffman Antonina Mitrofanova |
author_facet | Zachary S. Bohannan Frederick Coffman Antonina Mitrofanova |
author_sort | Zachary S. Bohannan |
collection | DOAJ |
description | High-risk pediatric B-ALL patients experience 5-year negative event rates up to 25%. Although some biomarkers of relapse are utilized in the clinic, their ability to predict outcomes in high-risk patients is limited. Here, we propose a random survival forest (RSF) machine learning model utilizing interpretable genomic inputs to predict relapse/death in high-risk pediatric B-ALL patients. We utilized whole exome sequencing profiles from 156 patients in the TARGET-ALL study (with samples collected at presentation) further stratified into training and test cohorts (109 and 47 patients, respectively). To avoid overfitting and facilitate the interpretation of machine learning results, input genomic variables were engineered using a stepwise approach involving univariable Cox models to select variables directly associated with outcomes, genomic coordinate-based analysis to select mutational hotspots, and correlation analysis to eliminate feature co-linearity. Model training identified 7 genomic regions most predictive of relapse/death-free survival. The test cohort error rate was 12.47%, and a polygenic score based on the sum of the top 7 variables effectively stratified patients into two groups, with significant differences in time to relapse/death (log-rank P = 0.001, hazard ratio = 5.41). Our model outperformed other EFS modeling approaches including an RSF using gold-standard prognostic variables (error rate = 24.35%). Validation in 174 standard-risk patients and 3 patients who failed to respond to induction therapy confirmed that our RSF model and polygenic score were specific to high-risk disease. We propose that our feature selection/engineering approach can increase the clinical interpretability of RSF, and our polygenic score could be utilized for enhance clinical decision-making in high-risk B-ALL. |
first_indexed | 2024-04-11T05:20:35Z |
format | Article |
id | doaj.art-ba8b3b6cf6584f5f8f43d495544b58f2 |
institution | Directory Open Access Journal |
issn | 2001-0370 |
language | English |
last_indexed | 2024-04-11T05:20:35Z |
publishDate | 2022-01-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj.art-ba8b3b6cf6584f5f8f43d495544b58f22022-12-24T04:51:09ZengElsevierComputational and Structural Biotechnology Journal2001-03702022-01-0120583597Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemiaZachary S. Bohannan0Frederick Coffman1Antonina Mitrofanova2Rutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Suite 120, Newark, NJ 07107-1709, United StatesRutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Suite 120, Newark, NJ 07107-1709, United StatesCorresponding author at: Rutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Room 923B, Newark, NJ, 07107, United States.; Rutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Suite 120, Newark, NJ 07107-1709, United StatesHigh-risk pediatric B-ALL patients experience 5-year negative event rates up to 25%. Although some biomarkers of relapse are utilized in the clinic, their ability to predict outcomes in high-risk patients is limited. Here, we propose a random survival forest (RSF) machine learning model utilizing interpretable genomic inputs to predict relapse/death in high-risk pediatric B-ALL patients. We utilized whole exome sequencing profiles from 156 patients in the TARGET-ALL study (with samples collected at presentation) further stratified into training and test cohorts (109 and 47 patients, respectively). To avoid overfitting and facilitate the interpretation of machine learning results, input genomic variables were engineered using a stepwise approach involving univariable Cox models to select variables directly associated with outcomes, genomic coordinate-based analysis to select mutational hotspots, and correlation analysis to eliminate feature co-linearity. Model training identified 7 genomic regions most predictive of relapse/death-free survival. The test cohort error rate was 12.47%, and a polygenic score based on the sum of the top 7 variables effectively stratified patients into two groups, with significant differences in time to relapse/death (log-rank P = 0.001, hazard ratio = 5.41). Our model outperformed other EFS modeling approaches including an RSF using gold-standard prognostic variables (error rate = 24.35%). Validation in 174 standard-risk patients and 3 patients who failed to respond to induction therapy confirmed that our RSF model and polygenic score were specific to high-risk disease. We propose that our feature selection/engineering approach can increase the clinical interpretability of RSF, and our polygenic score could be utilized for enhance clinical decision-making in high-risk B-ALL.http://www.sciencedirect.com/science/article/pii/S2001037022000058Machine learningRandom survival forestGenomicsClinical oncologyBioinformatics |
spellingShingle | Zachary S. Bohannan Frederick Coffman Antonina Mitrofanova Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia Computational and Structural Biotechnology Journal Machine learning Random survival forest Genomics Clinical oncology Bioinformatics |
title | Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia |
title_full | Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia |
title_fullStr | Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia |
title_full_unstemmed | Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia |
title_short | Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia |
title_sort | random survival forest model identifies novel biomarkers of event free survival in high risk pediatric acute lymphoblastic leukemia |
topic | Machine learning Random survival forest Genomics Clinical oncology Bioinformatics |
url | http://www.sciencedirect.com/science/article/pii/S2001037022000058 |
work_keys_str_mv | AT zacharysbohannan randomsurvivalforestmodelidentifiesnovelbiomarkersofeventfreesurvivalinhighriskpediatricacutelymphoblasticleukemia AT frederickcoffman randomsurvivalforestmodelidentifiesnovelbiomarkersofeventfreesurvivalinhighriskpediatricacutelymphoblasticleukemia AT antoninamitrofanova randomsurvivalforestmodelidentifiesnovelbiomarkersofeventfreesurvivalinhighriskpediatricacutelymphoblasticleukemia |