Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants

Abstract Background Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral...

Full description

Bibliographic Details
Main Authors: Viachaslau Tsyvina, David S. Campo, Seth Sims, Alex Zelikovsky, Yury Khudyakov, Pavel Skums
Format: Article
Language:English
Published: BMC 2018-10-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-018-2333-9
_version_ 1818171704847892480
author Viachaslau Tsyvina
David S. Campo
Seth Sims
Alex Zelikovsky
Yury Khudyakov
Pavel Skums
author_facet Viachaslau Tsyvina
David S. Campo
Seth Sims
Alex Zelikovsky
Yury Khudyakov
Pavel Skums
author_sort Viachaslau Tsyvina
collection DOAJ
description Abstract Background Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral datasets sampled from infected individuals or studying of evolution of viruses or immune repertoires by analysis of network of intra-host viral variants or antibody clonotypes formed by genetically close sequences. The most obvious naïeve algorithms to extract such sequence families are impractical in light of the massive size of modern NGS datasets. Results In this paper, we present fast and scalable k-mer-based framework to perform such sequence similarity queries efficiently, which specifically targets data produced by deep sequencing of heterogeneous populations such as viruses. It shows better filtering quality and time performance when comparing to other tools. The tool is freely available for download at https://github.com/vyacheslav-tsivina/signature-sj Conclusion The proposed tool allows for efficient detection of genetic relatedness between genomic samples produced by deep sequencing of heterogeneous populations. It should be especially useful for analysis of relatedness of genomes of viruses with unevenly distributed variable genomic regions, such as HIV and HCV. For the future we envision, that besides applications in molecular epidemiology the tool can also be adapted to immunosequencing and metagenomics data.
first_indexed 2024-12-11T19:00:57Z
format Article
id doaj.art-7a68e817401d4966ba526696c5de9070
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-11T19:00:57Z
publishDate 2018-10-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-7a68e817401d4966ba526696c5de90702022-12-22T00:54:01ZengBMCBMC Bioinformatics1471-21052018-10-0119S1111010.1186/s12859-018-2333-9Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variantsViachaslau Tsyvina0David S. Campo1Seth Sims2Alex Zelikovsky3Yury Khudyakov4Pavel Skums5Computer Science Department, Georgia State UniversityMolecular Epidemiology and Bioinformatics Laboratory, Division of Viral Hepatitis, Centers for Disease Control and PreventionComputer Science Department, Georgia State UniversityComputer Science Department, Georgia State UniversityMolecular Epidemiology and Bioinformatics Laboratory, Division of Viral Hepatitis, Centers for Disease Control and PreventionComputer Science Department, Georgia State UniversityAbstract Background Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral datasets sampled from infected individuals or studying of evolution of viruses or immune repertoires by analysis of network of intra-host viral variants or antibody clonotypes formed by genetically close sequences. The most obvious naïeve algorithms to extract such sequence families are impractical in light of the massive size of modern NGS datasets. Results In this paper, we present fast and scalable k-mer-based framework to perform such sequence similarity queries efficiently, which specifically targets data produced by deep sequencing of heterogeneous populations such as viruses. It shows better filtering quality and time performance when comparing to other tools. The tool is freely available for download at https://github.com/vyacheslav-tsivina/signature-sj Conclusion The proposed tool allows for efficient detection of genetic relatedness between genomic samples produced by deep sequencing of heterogeneous populations. It should be especially useful for analysis of relatedness of genomes of viruses with unevenly distributed variable genomic regions, such as HIV and HCV. For the future we envision, that besides applications in molecular epidemiology the tool can also be adapted to immunosequencing and metagenomics data.http://link.springer.com/article/10.1186/s12859-018-2333-9Similarity searchSimilarity joinK-merFilteringEdit distanceHamming distance
spellingShingle Viachaslau Tsyvina
David S. Campo
Seth Sims
Alex Zelikovsky
Yury Khudyakov
Pavel Skums
Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants
BMC Bioinformatics
Similarity search
Similarity join
K-mer
Filtering
Edit distance
Hamming distance
title Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants
title_full Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants
title_fullStr Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants
title_full_unstemmed Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants
title_short Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants
title_sort fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants
topic Similarity search
Similarity join
K-mer
Filtering
Edit distance
Hamming distance
url http://link.springer.com/article/10.1186/s12859-018-2333-9
work_keys_str_mv AT viachaslautsyvina fastestimationofgeneticrelatednessbetweenmembersofheterogeneouspopulationsofcloselyrelatedgenomicvariants
AT davidscampo fastestimationofgeneticrelatednessbetweenmembersofheterogeneouspopulationsofcloselyrelatedgenomicvariants
AT sethsims fastestimationofgeneticrelatednessbetweenmembersofheterogeneouspopulationsofcloselyrelatedgenomicvariants
AT alexzelikovsky fastestimationofgeneticrelatednessbetweenmembersofheterogeneouspopulationsofcloselyrelatedgenomicvariants
AT yurykhudyakov fastestimationofgeneticrelatednessbetweenmembersofheterogeneouspopulationsofcloselyrelatedgenomicvariants
AT pavelskums fastestimationofgeneticrelatednessbetweenmembersofheterogeneouspopulationsofcloselyrelatedgenomicvariants