Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in context

Background: The diagnosis of multidrug resistant and extensively drug resistant tuberculosis is a global health priority. Whole genome sequencing of clinical Mycobacterium tuberculosis isolates promises to circumvent the long wait times and limited scope of conventional phenotypic antimicrobial susc...

Full description

Bibliographic Details
Main Authors: Michael L. Chen, Akshith Doddi, Jimmy Royer, Luca Freschi, Marco Schito, Matthew Ezewudo, Isaac S. Kohane, Andrew Beam, Maha Farhat
Format: Article
Language:English
Published: Elsevier 2019-05-01
Series:EBioMedicine
Online Access:http://www.sciencedirect.com/science/article/pii/S2352396419302506
_version_ 1818180003917987840
author Michael L. Chen
Akshith Doddi
Jimmy Royer
Luca Freschi
Marco Schito
Matthew Ezewudo
Isaac S. Kohane
Andrew Beam
Maha Farhat
author_facet Michael L. Chen
Akshith Doddi
Jimmy Royer
Luca Freschi
Marco Schito
Matthew Ezewudo
Isaac S. Kohane
Andrew Beam
Maha Farhat
author_sort Michael L. Chen
collection DOAJ
description Background: The diagnosis of multidrug resistant and extensively drug resistant tuberculosis is a global health priority. Whole genome sequencing of clinical Mycobacterium tuberculosis isolates promises to circumvent the long wait times and limited scope of conventional phenotypic antimicrobial susceptibility, but gaps remain for predicting phenotype accurately from genotypic data especially for certain drugs. Our primary aim was to perform an exploration of statistical learning algorithms and genetic predictor sets using a rich dataset to build a high performing and fast predicting model to detect anti-tuberculosis drug resistance. Methods: We collected targeted or whole genome sequencing and conventional drug resistance phenotyping data from 3601 Mycobacterium tuberculosis strains enriched for resistance to first- and second-line drugs, with 1228 multidrug resistant strains. We investigated the utility of (1) rare variants and variants known to be determinants of resistance for at least one drug and (2) machine and statistical learning architectures in predicting phenotypic drug resistance to 10 anti-tuberculosis drugs. Specifically, we investigated multitask and single task wide and deep neural networks, a multilayer perceptron, regularized logistic regression, and random forest classifiers. Findings: The highest performing machine and statistical learning methods included both rare variants and those known to be causal of resistance for at least one drug. Both simpler L2 penalized regression and complex machine learning models had high predictive performance. The average AUCs for our highest performing model was 0.979 for first-line drugs and 0.936 for second-line drugs during repeated cross-validation. On an independent validation set, the highest performing model showed average AUCs, sensitivities, and specificities, respectively, of 0.937, 87.9%, and 92.7% for first-line drugs and 0.891, 82.0% and 90.1% for second-line drugs. Our method outperforms existing approaches based on direct association, with increased sum of sensitivity and specificity of 11.7% on first line drugs and 3.2% on second line drugs. Our method has higher predictive performance compared to previously reported machine learning models during cross-validation, with higher AUCs for 8 of 10 drugs. Interpretation: Statistical models, especially those that are trained using both frequent and less frequent variants, significantly improve the accuracy of resistance prediction and hold promise in bringing sequencing technologies closer to the bedside. Keywords: Mycobacterium tuberculosis, Multidrug-resistance, Extensively drug-resistant tuberculosis, Machine learning, Genome sequencing
first_indexed 2024-12-11T21:12:52Z
format Article
id doaj.art-f77b3956a01646a48185a63482e0b23e
institution Directory Open Access Journal
issn 2352-3964
language English
last_indexed 2024-12-11T21:12:52Z
publishDate 2019-05-01
publisher Elsevier
record_format Article
series EBioMedicine
spelling doaj.art-f77b3956a01646a48185a63482e0b23e2022-12-22T00:50:41ZengElsevierEBioMedicine2352-39642019-05-0143356369Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in contextMichael L. Chen0Akshith Doddi1Jimmy Royer2Luca Freschi3Marco Schito4Matthew Ezewudo5Isaac S. Kohane6Andrew Beam7Maha Farhat8Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of AmericaUniversity of Virginia School of Medicine, Charlottesville, VA, United States of AmericaAnalysis Group Inc., United States of AmericaDepartment of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of AmericaCritical Path Institute, 1730 E River Rd., Tucson, AZ, United States of AmericaCritical Path Institute, 1730 E River Rd., Tucson, AZ, United States of AmericaDepartment of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of AmericaDepartment of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America; Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, United States of AmericaDepartment of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America; Division of Pulmonary & Critical Care, Massachusetts General Hospital, Boston, MA, United States of America; Corresponding author at: Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America.Background: The diagnosis of multidrug resistant and extensively drug resistant tuberculosis is a global health priority. Whole genome sequencing of clinical Mycobacterium tuberculosis isolates promises to circumvent the long wait times and limited scope of conventional phenotypic antimicrobial susceptibility, but gaps remain for predicting phenotype accurately from genotypic data especially for certain drugs. Our primary aim was to perform an exploration of statistical learning algorithms and genetic predictor sets using a rich dataset to build a high performing and fast predicting model to detect anti-tuberculosis drug resistance. Methods: We collected targeted or whole genome sequencing and conventional drug resistance phenotyping data from 3601 Mycobacterium tuberculosis strains enriched for resistance to first- and second-line drugs, with 1228 multidrug resistant strains. We investigated the utility of (1) rare variants and variants known to be determinants of resistance for at least one drug and (2) machine and statistical learning architectures in predicting phenotypic drug resistance to 10 anti-tuberculosis drugs. Specifically, we investigated multitask and single task wide and deep neural networks, a multilayer perceptron, regularized logistic regression, and random forest classifiers. Findings: The highest performing machine and statistical learning methods included both rare variants and those known to be causal of resistance for at least one drug. Both simpler L2 penalized regression and complex machine learning models had high predictive performance. The average AUCs for our highest performing model was 0.979 for first-line drugs and 0.936 for second-line drugs during repeated cross-validation. On an independent validation set, the highest performing model showed average AUCs, sensitivities, and specificities, respectively, of 0.937, 87.9%, and 92.7% for first-line drugs and 0.891, 82.0% and 90.1% for second-line drugs. Our method outperforms existing approaches based on direct association, with increased sum of sensitivity and specificity of 11.7% on first line drugs and 3.2% on second line drugs. Our method has higher predictive performance compared to previously reported machine learning models during cross-validation, with higher AUCs for 8 of 10 drugs. Interpretation: Statistical models, especially those that are trained using both frequent and less frequent variants, significantly improve the accuracy of resistance prediction and hold promise in bringing sequencing technologies closer to the bedside. Keywords: Mycobacterium tuberculosis, Multidrug-resistance, Extensively drug-resistant tuberculosis, Machine learning, Genome sequencinghttp://www.sciencedirect.com/science/article/pii/S2352396419302506
spellingShingle Michael L. Chen
Akshith Doddi
Jimmy Royer
Luca Freschi
Marco Schito
Matthew Ezewudo
Isaac S. Kohane
Andrew Beam
Maha Farhat
Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in context
EBioMedicine
title Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in context
title_full Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in context
title_fullStr Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in context
title_full_unstemmed Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in context
title_short Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionResearch in context
title_sort beyond multidrug resistance leveraging rare variants with machine and statistical learning models in mycobacterium tuberculosis resistance predictionresearch in context
url http://www.sciencedirect.com/science/article/pii/S2352396419302506
work_keys_str_mv AT michaellchen beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT akshithdoddi beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT jimmyroyer beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT lucafreschi beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT marcoschito beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT matthewezewudo beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT isaacskohane beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT andrewbeam beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext
AT mahafarhat beyondmultidrugresistanceleveragingrarevariantswithmachineandstatisticallearningmodelsinmycobacteriumtuberculosisresistancepredictionresearchincontext