Detecting gene breakpoints in noisy genome sequences using position-annotated colored de-Bruijn graphs

Abstract Background Identifying the locations of gene breakpoints between species of different taxonomic groups can provide useful insights into the underlying evolutionary processes. Given the exact locations of their genes, the breakpoints can be computed without much effort. However, often, exist...

Full description

Bibliographic Details
Main Authors: Lisa Fiedler, Matthias Bernt, Martin Middendorf, Peter F. Stadler
Format: Article
Language:English
Published: BMC 2023-06-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-023-05371-4
Description
Summary:Abstract Background Identifying the locations of gene breakpoints between species of different taxonomic groups can provide useful insights into the underlying evolutionary processes. Given the exact locations of their genes, the breakpoints can be computed without much effort. However, often, existing gene annotations are erroneous, or only nucleotide sequences are available. Especially in mitochondrial genomes, high variations in gene orders are usually accompanied by a high degree of sequence inconsistencies. This makes accurately locating breakpoints in mitogenomic nucleotide sequences a challenging task. Results This contribution presents a novel method for detecting gene breakpoints in the nucleotide sequences of complete mitochondrial genomes, taking into account possible high substitution rates. The method is implemented in the software package DeBBI. DeBBI allows to analyze transposition- and inversion-based breakpoints independently and uses a parallel program design, allowing to make use of modern multi-processor systems. Extensive tests on synthetic data sets, covering a broad range of sequence dissimilarities and different numbers of introduced breakpoints, demonstrate DeBBI ’s ability to produce accurate results. Case studies using species of various taxonomic groups further show DeBBI ’s applicability to real-life data. While (some) multiple sequence alignment tools can also be used for the task at hand, we demonstrate that especially gene breaks between short, poorly conserved tRNA genes can be detected more frequently with the proposed approach. Conclusion The proposed method constructs a position-annotated de-Bruijn graph of the input sequences. Using a heuristic algorithm, this graph is searched for particular structures, called bulges, which may be associated with the breakpoint locations. Despite the large size of these structures, the algorithm only requires a small number of graph traversal steps.
ISSN:1471-2105