Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia

High-risk pediatric B-ALL patients experience 5-year negative event rates up to 25%. Although some biomarkers of relapse are utilized in the clinic, their ability to predict outcomes in high-risk patients is limited. Here, we propose a random survival forest (RSF) machine learning model utilizing in...

Full description

Bibliographic Details
Main Authors: Zachary S. Bohannan, Frederick Coffman, Antonina Mitrofanova
Format: Article
Language:English
Published: Elsevier 2022-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037022000058
_version_ 1797978310696763392
author Zachary S. Bohannan
Frederick Coffman
Antonina Mitrofanova
author_facet Zachary S. Bohannan
Frederick Coffman
Antonina Mitrofanova
author_sort Zachary S. Bohannan
collection DOAJ
description High-risk pediatric B-ALL patients experience 5-year negative event rates up to 25%. Although some biomarkers of relapse are utilized in the clinic, their ability to predict outcomes in high-risk patients is limited. Here, we propose a random survival forest (RSF) machine learning model utilizing interpretable genomic inputs to predict relapse/death in high-risk pediatric B-ALL patients. We utilized whole exome sequencing profiles from 156 patients in the TARGET-ALL study (with samples collected at presentation) further stratified into training and test cohorts (109 and 47 patients, respectively). To avoid overfitting and facilitate the interpretation of machine learning results, input genomic variables were engineered using a stepwise approach involving univariable Cox models to select variables directly associated with outcomes, genomic coordinate-based analysis to select mutational hotspots, and correlation analysis to eliminate feature co-linearity. Model training identified 7 genomic regions most predictive of relapse/death-free survival. The test cohort error rate was 12.47%, and a polygenic score based on the sum of the top 7 variables effectively stratified patients into two groups, with significant differences in time to relapse/death (log-rank P = 0.001, hazard ratio = 5.41). Our model outperformed other EFS modeling approaches including an RSF using gold-standard prognostic variables (error rate = 24.35%). Validation in 174 standard-risk patients and 3 patients who failed to respond to induction therapy confirmed that our RSF model and polygenic score were specific to high-risk disease. We propose that our feature selection/engineering approach can increase the clinical interpretability of RSF, and our polygenic score could be utilized for enhance clinical decision-making in high-risk B-ALL.
first_indexed 2024-04-11T05:20:35Z
format Article
id doaj.art-ba8b3b6cf6584f5f8f43d495544b58f2
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-04-11T05:20:35Z
publishDate 2022-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-ba8b3b6cf6584f5f8f43d495544b58f22022-12-24T04:51:09ZengElsevierComputational and Structural Biotechnology Journal2001-03702022-01-0120583597Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemiaZachary S. Bohannan0Frederick Coffman1Antonina Mitrofanova2Rutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Suite 120, Newark, NJ 07107-1709, United StatesRutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Suite 120, Newark, NJ 07107-1709, United StatesCorresponding author at: Rutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Room 923B, Newark, NJ, 07107, United States.; Rutgers, The State University of New Jersey, School of Health Professions, Department of Health Informatics, 65 Bergen Street, Suite 120, Newark, NJ 07107-1709, United StatesHigh-risk pediatric B-ALL patients experience 5-year negative event rates up to 25%. Although some biomarkers of relapse are utilized in the clinic, their ability to predict outcomes in high-risk patients is limited. Here, we propose a random survival forest (RSF) machine learning model utilizing interpretable genomic inputs to predict relapse/death in high-risk pediatric B-ALL patients. We utilized whole exome sequencing profiles from 156 patients in the TARGET-ALL study (with samples collected at presentation) further stratified into training and test cohorts (109 and 47 patients, respectively). To avoid overfitting and facilitate the interpretation of machine learning results, input genomic variables were engineered using a stepwise approach involving univariable Cox models to select variables directly associated with outcomes, genomic coordinate-based analysis to select mutational hotspots, and correlation analysis to eliminate feature co-linearity. Model training identified 7 genomic regions most predictive of relapse/death-free survival. The test cohort error rate was 12.47%, and a polygenic score based on the sum of the top 7 variables effectively stratified patients into two groups, with significant differences in time to relapse/death (log-rank P = 0.001, hazard ratio = 5.41). Our model outperformed other EFS modeling approaches including an RSF using gold-standard prognostic variables (error rate = 24.35%). Validation in 174 standard-risk patients and 3 patients who failed to respond to induction therapy confirmed that our RSF model and polygenic score were specific to high-risk disease. We propose that our feature selection/engineering approach can increase the clinical interpretability of RSF, and our polygenic score could be utilized for enhance clinical decision-making in high-risk B-ALL.http://www.sciencedirect.com/science/article/pii/S2001037022000058Machine learningRandom survival forestGenomicsClinical oncologyBioinformatics
spellingShingle Zachary S. Bohannan
Frederick Coffman
Antonina Mitrofanova
Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
Computational and Structural Biotechnology Journal
Machine learning
Random survival forest
Genomics
Clinical oncology
Bioinformatics
title Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
title_full Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
title_fullStr Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
title_full_unstemmed Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
title_short Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
title_sort random survival forest model identifies novel biomarkers of event free survival in high risk pediatric acute lymphoblastic leukemia
topic Machine learning
Random survival forest
Genomics
Clinical oncology
Bioinformatics
url http://www.sciencedirect.com/science/article/pii/S2001037022000058
work_keys_str_mv AT zacharysbohannan randomsurvivalforestmodelidentifiesnovelbiomarkersofeventfreesurvivalinhighriskpediatricacutelymphoblasticleukemia
AT frederickcoffman randomsurvivalforestmodelidentifiesnovelbiomarkersofeventfreesurvivalinhighriskpediatricacutelymphoblasticleukemia
AT antoninamitrofanova randomsurvivalforestmodelidentifiesnovelbiomarkersofeventfreesurvivalinhighriskpediatricacutelymphoblasticleukemia