The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.

We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-ca...

Full description

Bibliographic Details
Main Authors: Sophie Röhling, Alexander Linne, Jendrik Schellhorn, Morteza Hosseini, Thomas Dencker, Burkhard Morgenstern
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0228070
_version_ 1819031831359520768
author Sophie Röhling
Alexander Linne
Jendrik Schellhorn
Morteza Hosseini
Thomas Dencker
Burkhard Morgenstern
author_facet Sophie Röhling
Alexander Linne
Jendrik Schellhorn
Morteza Hosseini
Thomas Dencker
Burkhard Morgenstern
author_sort Sophie Röhling
collection DOAJ
description We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
first_indexed 2024-12-21T06:52:18Z
format Article
id doaj.art-ff1f70215f2143a09cc0a9ed2f7c795f
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-21T06:52:18Z
publishDate 2020-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-ff1f70215f2143a09cc0a9ed2f7c795f2022-12-21T19:12:27ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01152e022807010.1371/journal.pone.0228070The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.Sophie RöhlingAlexander LinneJendrik SchellhornMorteza HosseiniThomas DenckerBurkhard MorgensternWe study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.https://doi.org/10.1371/journal.pone.0228070
spellingShingle Sophie Röhling
Alexander Linne
Jendrik Schellhorn
Morteza Hosseini
Thomas Dencker
Burkhard Morgenstern
The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
PLoS ONE
title The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
title_full The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
title_fullStr The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
title_full_unstemmed The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
title_short The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
title_sort number of k mer matches between two dna sequences as a function of k and applications to estimate phylogenetic distances
url https://doi.org/10.1371/journal.pone.0228070
work_keys_str_mv AT sophierohling thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT alexanderlinne thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT jendrikschellhorn thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT mortezahosseini thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT thomasdencker thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT burkhardmorgenstern thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT sophierohling numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT alexanderlinne numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT jendrikschellhorn numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT mortezahosseini numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT thomasdencker numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT burkhardmorgenstern numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances