The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-ca...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2020-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0228070 |
_version_ | 1819031831359520768 |
---|---|
author | Sophie Röhling Alexander Linne Jendrik Schellhorn Morteza Hosseini Thomas Dencker Burkhard Morgenstern |
author_facet | Sophie Röhling Alexander Linne Jendrik Schellhorn Morteza Hosseini Thomas Dencker Burkhard Morgenstern |
author_sort | Sophie Röhling |
collection | DOAJ |
description | We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies. |
first_indexed | 2024-12-21T06:52:18Z |
format | Article |
id | doaj.art-ff1f70215f2143a09cc0a9ed2f7c795f |
institution | Directory Open Access Journal |
issn | 1932-6203 |
language | English |
last_indexed | 2024-12-21T06:52:18Z |
publishDate | 2020-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj.art-ff1f70215f2143a09cc0a9ed2f7c795f2022-12-21T19:12:27ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01152e022807010.1371/journal.pone.0228070The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.Sophie RöhlingAlexander LinneJendrik SchellhornMorteza HosseiniThomas DenckerBurkhard MorgensternWe study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.https://doi.org/10.1371/journal.pone.0228070 |
spellingShingle | Sophie Röhling Alexander Linne Jendrik Schellhorn Morteza Hosseini Thomas Dencker Burkhard Morgenstern The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS ONE |
title | The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. |
title_full | The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. |
title_fullStr | The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. |
title_full_unstemmed | The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. |
title_short | The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. |
title_sort | number of k mer matches between two dna sequences as a function of k and applications to estimate phylogenetic distances |
url | https://doi.org/10.1371/journal.pone.0228070 |
work_keys_str_mv | AT sophierohling thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT alexanderlinne thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT jendrikschellhorn thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT mortezahosseini thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT thomasdencker thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT burkhardmorgenstern thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT sophierohling numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT alexanderlinne numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT jendrikschellhorn numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT mortezahosseini numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT thomasdencker numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances AT burkhardmorgenstern numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances |