A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking

<h4>Background</h4> As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented i...

Full description

Bibliographic Details
Main Authors: Yen-Yi Liu, Chih-Chieh Chen
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2021-01-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8604304/?tool=EBI
_version_ 1818394696498544640
author Yen-Yi Liu
Chih-Chieh Chen
author_facet Yen-Yi Liu
Chih-Chieh Chen
author_sort Yen-Yi Liu
collection DOAJ
description <h4>Background</h4> As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create “specious discrepancy” among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times. <h4>Methods</h4> We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected. <h4>Results</h4> Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene–based epidemiology.
first_indexed 2024-12-14T06:05:19Z
format Article
id doaj.art-0909bd3656ed4b978a472b992883c198
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-14T06:05:19Z
publishDate 2021-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-0909bd3656ed4b978a472b992883c1982022-12-21T23:14:18ZengPublic Library of Science (PLoS)PLoS ONE1932-62032021-01-011611A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak trackingYen-Yi LiuChih-Chieh Chen<h4>Background</h4> As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create “specious discrepancy” among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times. <h4>Methods</h4> We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected. <h4>Results</h4> Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene–based epidemiology.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8604304/?tool=EBI
spellingShingle Yen-Yi Liu
Chih-Chieh Chen
A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
PLoS ONE
title A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
title_full A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
title_fullStr A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
title_full_unstemmed A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
title_short A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
title_sort machine learning based typing scheme refinement for listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8604304/?tool=EBI
work_keys_str_mv AT yenyiliu amachinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking
AT chihchiehchen amachinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking
AT yenyiliu machinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking
AT chihchiehchen machinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking