Hash-based core genome multilocus sequencing typing for Clostridium difficile
Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely related genomes among a background of thousands of other genomes is challenging. Here, we describe a refinement to core genome multilocus sequence typing (c...
Main Authors: | , , , , |
---|---|
Format: | Journal article |
Language: | English |
Published: |
American Society for Microbiology
2019
|
_version_ | 1826262914723479552 |
---|---|
author | Eyre, DW Peto, TEA Crook, DW Walker, AS Wilcox, MH |
author_facet | Eyre, DW Peto, TEA Crook, DW Walker, AS Wilcox, MH |
author_sort | Eyre, DW |
collection | OXFORD |
description | Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely related genomes among a background of thousands of other genomes is challenging. Here, we describe a refinement to core genome multilocus sequence typing (cgMLST) in which alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralized database of sequentially numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to those of mapping-based approaches in Clostridium difficile, using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals. Hash-cgMLST provided the same results as standard cgMLST, with minimal performance penalty. Comparing 272 replicate sequence pairs using reference-based mapping, there were 0, 1, or 2 single-nucleotide polymorphisms (SNPs) between 262 (96%), 5 (2%), and 1 (<1%) of the pairs, respectively. Using hash-cgMLST, 218 (80%) of replicate pairs assembled with SPAdes had zero gene differences, and 31 (11%), 5 (2%), and 18 (7%) pairs had 1, 2, and >2 differences, respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies, but were reduced using the SKESA assembler. Considering 412 pairs of infections with ≤2 SNPS, i.e., consistent with recent transmission, 376 (91%) had ≤2 gene differences and 16 (4%) had ≥4. Comparing a genome to 100,000 others took <1 min using hash-cgMLST. Hash-cgMLST is an effective surveillance tool for rapidly identifying clusters of related genomes. However, cgMLST/hash-cgMLST generate more false variants than mapping-based approaches. Follow-up mapping-based analyses are likely required to precisely define close genetic relationships. |
first_indexed | 2024-03-06T19:43:22Z |
format | Journal article |
id | oxford-uuid:216bb92c-b63d-4d2f-971e-fe6845a5fdbb |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-06T19:43:22Z |
publishDate | 2019 |
publisher | American Society for Microbiology |
record_format | dspace |
spelling | oxford-uuid:216bb92c-b63d-4d2f-971e-fe6845a5fdbb2022-03-26T11:33:21ZHash-based core genome multilocus sequencing typing for Clostridium difficileJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:216bb92c-b63d-4d2f-971e-fe6845a5fdbbEnglishSymplectic Elements at OxfordAmerican Society for Microbiology2019Eyre, DWPeto, TEACrook, DWWalker, ASWilcox, MHPathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely related genomes among a background of thousands of other genomes is challenging. Here, we describe a refinement to core genome multilocus sequence typing (cgMLST) in which alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralized database of sequentially numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to those of mapping-based approaches in Clostridium difficile, using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals. Hash-cgMLST provided the same results as standard cgMLST, with minimal performance penalty. Comparing 272 replicate sequence pairs using reference-based mapping, there were 0, 1, or 2 single-nucleotide polymorphisms (SNPs) between 262 (96%), 5 (2%), and 1 (<1%) of the pairs, respectively. Using hash-cgMLST, 218 (80%) of replicate pairs assembled with SPAdes had zero gene differences, and 31 (11%), 5 (2%), and 18 (7%) pairs had 1, 2, and >2 differences, respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies, but were reduced using the SKESA assembler. Considering 412 pairs of infections with ≤2 SNPS, i.e., consistent with recent transmission, 376 (91%) had ≤2 gene differences and 16 (4%) had ≥4. Comparing a genome to 100,000 others took <1 min using hash-cgMLST. Hash-cgMLST is an effective surveillance tool for rapidly identifying clusters of related genomes. However, cgMLST/hash-cgMLST generate more false variants than mapping-based approaches. Follow-up mapping-based analyses are likely required to precisely define close genetic relationships. |
spellingShingle | Eyre, DW Peto, TEA Crook, DW Walker, AS Wilcox, MH Hash-based core genome multilocus sequencing typing for Clostridium difficile |
title | Hash-based core genome multilocus sequencing typing for Clostridium difficile |
title_full | Hash-based core genome multilocus sequencing typing for Clostridium difficile |
title_fullStr | Hash-based core genome multilocus sequencing typing for Clostridium difficile |
title_full_unstemmed | Hash-based core genome multilocus sequencing typing for Clostridium difficile |
title_short | Hash-based core genome multilocus sequencing typing for Clostridium difficile |
title_sort | hash based core genome multilocus sequencing typing for clostridium difficile |
work_keys_str_mv | AT eyredw hashbasedcoregenomemultilocussequencingtypingforclostridiumdifficile AT petotea hashbasedcoregenomemultilocussequencingtypingforclostridiumdifficile AT crookdw hashbasedcoregenomemultilocussequencingtypingforclostridiumdifficile AT walkeras hashbasedcoregenomemultilocussequencingtypingforclostridiumdifficile AT wilcoxmh hashbasedcoregenomemultilocussequencingtypingforclostridiumdifficile |