K-mer based prediction of Clostridioides difficile relatedness and ribotypes

Comparative analysis of Clostridioides difficile whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data invo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Moore, M, Wilcox, MH, Walker, AS, Eyre, D
Formato:	Journal article
Lenguaje:	English
Publicado:	Microbiology Society 2022

_version_	1826307578644135936
author	Moore, M Wilcox, MH Walker, AS Eyre, D
author_facet	Moore, M Wilcox, MH Walker, AS Eyre, D
author_sort	Moore, M
collection	OXFORD
description	Comparative analysis of Clostridioides difficile whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and C. difficile ribotypes (RTs). For a set of 1,905 diverse C. difficile genomes (differing by 0-168,519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100% for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1,813,560 overall to 161,934, i.e., by 91%, with a positive predictive value of 32% to correctly identify pairs ≤10 SNPs (maximum SNP distance 4,144). At a sensitivity of 95%, pairs were reduced by 94% to 108,266 and PPV increased to 45% (maximum SNP distance 1,009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (n=3,937) were split into a training set (2,937) and test set (1,000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest 5 genomes in the index had the same ribotype this was taken to predict the searched genome’s ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78%) genomes, incorrect in 20 (2%), and indeterminant in 200 (20%). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87%. Using MinHash it is possible to subsample C. difficile genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.
first_indexed	2024-03-07T07:05:12Z
format	Journal article
id	oxford-uuid:bcaa8fa7-b90a-4275-9f12-b4fce61b14b0
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:05:12Z
publishDate	2022
publisher	Microbiology Society
record_format	dspace
spelling	oxford-uuid:bcaa8fa7-b90a-4275-9f12-b4fce61b14b02022-05-03T10:53:35ZK-mer based prediction of Clostridioides difficile relatedness and ribotypesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:bcaa8fa7-b90a-4275-9f12-b4fce61b14b0EnglishSymplectic ElementsMicrobiology Society2022Moore, MWilcox, MHWalker, ASEyre, DComparative analysis of Clostridioides difficile whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and C. difficile ribotypes (RTs). For a set of 1,905 diverse C. difficile genomes (differing by 0-168,519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100% for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1,813,560 overall to 161,934, i.e., by 91%, with a positive predictive value of 32% to correctly identify pairs ≤10 SNPs (maximum SNP distance 4,144). At a sensitivity of 95%, pairs were reduced by 94% to 108,266 and PPV increased to 45% (maximum SNP distance 1,009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (n=3,937) were split into a training set (2,937) and test set (1,000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest 5 genomes in the index had the same ribotype this was taken to predict the searched genome’s ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78%) genomes, incorrect in 20 (2%), and indeterminant in 200 (20%). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87%. Using MinHash it is possible to subsample C. difficile genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.
spellingShingle	Moore, M Wilcox, MH Walker, AS Eyre, D K-mer based prediction of Clostridioides difficile relatedness and ribotypes
title	K-mer based prediction of Clostridioides difficile relatedness and ribotypes
title_full	K-mer based prediction of Clostridioides difficile relatedness and ribotypes
title_fullStr	K-mer based prediction of Clostridioides difficile relatedness and ribotypes
title_full_unstemmed	K-mer based prediction of Clostridioides difficile relatedness and ribotypes
title_short	K-mer based prediction of Clostridioides difficile relatedness and ribotypes
title_sort	k mer based prediction of clostridioides difficile relatedness and ribotypes
work_keys_str_mv	AT moorem kmerbasedpredictionofclostridioidesdifficilerelatednessandribotypes AT wilcoxmh kmerbasedpredictionofclostridioidesdifficilerelatednessandribotypes AT walkeras kmerbasedpredictionofclostridioidesdifficilerelatednessandribotypes AT eyred kmerbasedpredictionofclostridioidesdifficilerelatednessandribotypes

K-mer based prediction of Clostridioides difficile relatedness and ribotypes

Ejemplares similares