Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning

Abstract Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient’s normal tissue sampl...

Full description

Bibliographic Details
Main Authors: R. Tyler McLaughlin, Maansi Asthana, Marc Di Meo, Michele Ceccarelli, Howard J. Jacob, David L. Masica
Format: Article
Language:English
Published: Nature Portfolio 2023-01-01
Series:npj Precision Oncology
Online Access:https://doi.org/10.1038/s41698-022-00340-1
_version_ 1797641509188665344
author R. Tyler McLaughlin
Maansi Asthana
Marc Di Meo
Michele Ceccarelli
Howard J. Jacob
David L. Masica
author_facet R. Tyler McLaughlin
Maansi Asthana
Marc Di Meo
Michele Ceccarelli
Howard J. Jacob
David L. Masica
author_sort R. Tyler McLaughlin
collection DOAJ
description Abstract Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient’s normal tissue sample is sequenced), accurately distinguishing somatic mutations from germline variants is a challenging problem that, when unaddressed, results in unreliable, biased, and inflated TMB estimates. Here, we apply machine learning to the task of somatic vs germline classification in tumor-only solid tumor samples using TabNet, XGBoost, and LightGBM, three machine-learning models for tabular data. We constructed a training set for supervised classification using features derived exclusively from tumor-only variant calling and drawing somatic and germline truth labels from an independent pipeline using the patient-matched normal samples. All three trained models achieved state-of-the-art performance on two holdout test datasets: a TCGA dataset including sarcoma, breast adenocarcinoma, and endometrial carcinoma samples (AUC > 94%), and a metastatic melanoma dataset (AUC > 85%). Concordance between matched-normal and tumor-only TMB improves from R 2  = 0.006 to 0.71–0.76 with the addition of a machine-learning classifier, with LightGBM performing best. Notably, these machine-learning models generalize across cancer subtypes and capture kits with a call rate of 100%. We reproduce the recent finding that tumor-only TMB estimates for Black patients are extremely inflated relative to that of white patients due to the racial biases of germline databases. We show that our approach with XGBoost and LightGBM eliminates this significant racial bias in tumor-only variant calling.
first_indexed 2024-03-11T13:46:40Z
format Article
id doaj.art-c5b2cf3cf2c544e188a700aec68c1107
institution Directory Open Access Journal
issn 2397-768X
language English
last_indexed 2024-03-11T13:46:40Z
publishDate 2023-01-01
publisher Nature Portfolio
record_format Article
series npj Precision Oncology
spelling doaj.art-c5b2cf3cf2c544e188a700aec68c11072023-11-02T10:17:19ZengNature Portfolionpj Precision Oncology2397-768X2023-01-017111210.1038/s41698-022-00340-1Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learningR. Tyler McLaughlin0Maansi Asthana1Marc Di Meo2Michele Ceccarelli3Howard J. Jacob4David L. Masica5Genomics Research Center, AbbVieAgricultural and Biological Engineering at Purdue UniversityJohns Hopkins UniversityDepartment of Electrical Engineering and Information Technology, University of Naples “Federico II”Genomics Research Center, AbbVieGenomics Research Center, AbbVieAbstract Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient’s normal tissue sample is sequenced), accurately distinguishing somatic mutations from germline variants is a challenging problem that, when unaddressed, results in unreliable, biased, and inflated TMB estimates. Here, we apply machine learning to the task of somatic vs germline classification in tumor-only solid tumor samples using TabNet, XGBoost, and LightGBM, three machine-learning models for tabular data. We constructed a training set for supervised classification using features derived exclusively from tumor-only variant calling and drawing somatic and germline truth labels from an independent pipeline using the patient-matched normal samples. All three trained models achieved state-of-the-art performance on two holdout test datasets: a TCGA dataset including sarcoma, breast adenocarcinoma, and endometrial carcinoma samples (AUC > 94%), and a metastatic melanoma dataset (AUC > 85%). Concordance between matched-normal and tumor-only TMB improves from R 2  = 0.006 to 0.71–0.76 with the addition of a machine-learning classifier, with LightGBM performing best. Notably, these machine-learning models generalize across cancer subtypes and capture kits with a call rate of 100%. We reproduce the recent finding that tumor-only TMB estimates for Black patients are extremely inflated relative to that of white patients due to the racial biases of germline databases. We show that our approach with XGBoost and LightGBM eliminates this significant racial bias in tumor-only variant calling.https://doi.org/10.1038/s41698-022-00340-1
spellingShingle R. Tyler McLaughlin
Maansi Asthana
Marc Di Meo
Michele Ceccarelli
Howard J. Jacob
David L. Masica
Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
npj Precision Oncology
title Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_full Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_fullStr Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_full_unstemmed Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_short Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_sort fast accurate and racially unbiased pan cancer tumor only variant calling with tabular machine learning
url https://doi.org/10.1038/s41698-022-00340-1
work_keys_str_mv AT rtylermclaughlin fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning
AT maansiasthana fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning
AT marcdimeo fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning
AT michelececcarelli fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning
AT howardjjacob fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning
AT davidlmasica fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning