Fast NJ-like algorithms to deal with incomplete distance matrices

Abstract Background Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly dea...

Full description

Bibliographic Details
Main Authors:	Criscuolo Alexis, Gascuel Olivier
Format:	Article
Language:	English
Published:	BMC 2008-03-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/9/166

_version_	1819027077143199744
author	Criscuolo Alexis Gascuel Olivier
author_facet	Criscuolo Alexis Gascuel Olivier
author_sort	Criscuolo Alexis
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in <it>O</it>(<it>n</it><sup>4</sup>) or more, where <it>n </it>is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>) are much faster, with time complexity in <it>O</it>(<it>n</it><sup>3</sup>) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices.</p> <p>Results</p> <p>We propose an adaptation to incomplete matrices of three agglomerative algorithms, namely NJ, BIONJ <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and MVR <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Our adaptation generalizes to incomplete matrices the taxon pair selection criterion of NJ (also used by BIONJ and MVR), and combines this generalized criterion with that of ADDTREE <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Steps (b) and (c) are also modified, but <it>O</it>(<it>n</it><sup>3</sup>) time complexity is kept. The performance of these new algorithms is studied with large scale simulations, which mimic multi-gene phylogenomic datasets. Our new algorithms – named NJ, BIONJ and MVR* – infer phylogenetic trees that are as least as accurate as those inferred by other available methods, but with much faster running times. MVR* presents the best overall performance. This algorithm accounts for the variance of the pairwise evolutionary distance estimates, and is well suited for multi-gene studies where some distances are accurately estimated using numerous genes, whereas others are poorly estimated (or not estimated) due to the low number (absence) of sequenced genes being shared by both species.</p> <p>Conclusion</p> <p>Our distance-based agglomerative algorithms NJ, BIONJ and MVR* are fast and accurate, and should be quite useful for large scale phylogenomic studies. When combined with the SDM method <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> to estimate a distance matrix from multiple genes, they offer a relevant alternative to usual supertree techniques <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Binaries and all simulated data are downloadable from <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
first_indexed	2024-12-21T05:36:44Z
format	Article
id	doaj.art-4eb012e6503d4fd4bd7d7329daafd72c
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-21T05:36:44Z
publishDate	2008-03-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-4eb012e6503d4fd4bd7d7329daafd72c2022-12-21T19:14:23ZengBMCBMC Bioinformatics1471-21052008-03-019116610.1186/1471-2105-9-166Fast NJ-like algorithms to deal with incomplete distance matricesCriscuolo AlexisGascuel Olivier<p>Abstract</p> <p>Background</p> <p>Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in <it>O</it>(<it>n</it><sup>4</sup>) or more, where <it>n </it>is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>) are much faster, with time complexity in <it>O</it>(<it>n</it><sup>3</sup>) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices.</p> <p>Results</p> <p>We propose an adaptation to incomplete matrices of three agglomerative algorithms, namely NJ, BIONJ <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and MVR <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Our adaptation generalizes to incomplete matrices the taxon pair selection criterion of NJ (also used by BIONJ and MVR), and combines this generalized criterion with that of ADDTREE <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Steps (b) and (c) are also modified, but <it>O</it>(<it>n</it><sup>3</sup>) time complexity is kept. The performance of these new algorithms is studied with large scale simulations, which mimic multi-gene phylogenomic datasets. Our new algorithms – named NJ, BIONJ and MVR* – infer phylogenetic trees that are as least as accurate as those inferred by other available methods, but with much faster running times. MVR* presents the best overall performance. This algorithm accounts for the variance of the pairwise evolutionary distance estimates, and is well suited for multi-gene studies where some distances are accurately estimated using numerous genes, whereas others are poorly estimated (or not estimated) due to the low number (absence) of sequenced genes being shared by both species.</p> <p>Conclusion</p> <p>Our distance-based agglomerative algorithms NJ, BIONJ and MVR* are fast and accurate, and should be quite useful for large scale phylogenomic studies. When combined with the SDM method <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> to estimate a distance matrix from multiple genes, they offer a relevant alternative to usual supertree techniques <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Binaries and all simulated data are downloadable from <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>http://www.biomedcentral.com/1471-2105/9/166
spellingShingle	Criscuolo Alexis Gascuel Olivier Fast NJ-like algorithms to deal with incomplete distance matrices BMC Bioinformatics
title	Fast NJ-like algorithms to deal with incomplete distance matrices
title_full	Fast NJ-like algorithms to deal with incomplete distance matrices
title_fullStr	Fast NJ-like algorithms to deal with incomplete distance matrices
title_full_unstemmed	Fast NJ-like algorithms to deal with incomplete distance matrices
title_short	Fast NJ-like algorithms to deal with incomplete distance matrices
title_sort	fast nj like algorithms to deal with incomplete distance matrices
url	http://www.biomedcentral.com/1471-2105/9/166
work_keys_str_mv	AT criscuoloalexis fastnjlikealgorithmstodealwithincompletedistancematrices AT gascuelolivier fastnjlikealgorithmstodealwithincompletedistancematrices

Fast NJ-like algorithms to deal with incomplete distance matrices

Similar Items