Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.

Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially tru...

Full description

Bibliographic Details
Main Authors: Alisa Prusokiene, Neil Boonham, Adrian Fox, Thomas P Howard
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0298834
_version_ 1797242823351730176
author Alisa Prusokiene
Neil Boonham
Adrian Fox
Thomas P Howard
author_facet Alisa Prusokiene
Neil Boonham
Adrian Fox
Thomas P Howard
author_sort Alisa Prusokiene
collection DOAJ
description Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66-0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo.
first_indexed 2024-04-24T18:45:20Z
format Article
id doaj.art-78b6d1aa228d4fd880a263057483b08b
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-24T18:45:20Z
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-78b6d1aa228d4fd880a263057483b08b2024-03-27T05:32:40ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-01193e029883410.1371/journal.pone.0298834Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.Alisa PrusokieneNeil BoonhamAdrian FoxThomas P HowardCurrent tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66-0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo.https://doi.org/10.1371/journal.pone.0298834
spellingShingle Alisa Prusokiene
Neil Boonham
Adrian Fox
Thomas P Howard
Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.
PLoS ONE
title Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.
title_full Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.
title_fullStr Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.
title_full_unstemmed Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.
title_short Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.
title_sort mottle accurate pairwise substitution distance at high divergence through the exploitation of short read mappers and gradient descent
url https://doi.org/10.1371/journal.pone.0298834
work_keys_str_mv AT alisaprusokiene mottleaccuratepairwisesubstitutiondistanceathighdivergencethroughtheexploitationofshortreadmappersandgradientdescent
AT neilboonham mottleaccuratepairwisesubstitutiondistanceathighdivergencethroughtheexploitationofshortreadmappersandgradientdescent
AT adrianfox mottleaccuratepairwisesubstitutiondistanceathighdivergencethroughtheexploitationofshortreadmappersandgradientdescent
AT thomasphoward mottleaccuratepairwisesubstitutiondistanceathighdivergencethroughtheexploitationofshortreadmappersandgradientdescent