Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs

Abstract Background Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on th...

Full description

Bibliographic Details
Main Authors:	Vaughn Matthew, Dinh Hieu, Rajasekaran Sanguthevar, Kundeti Vamsi K, Thapar Vishal
Format:	Article
Language:	English
Published:	BMC 2010-11-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/11/560

_version_	1819119419460157440
author	Vaughn Matthew Dinh Hieu Rajasekaran Sanguthevar Kundeti Vamsi K Thapar Vishal
author_facet	Vaughn Matthew Dinh Hieu Rajasekaran Sanguthevar Kundeti Vamsi K Thapar Vishal
author_sort	Vaughn Matthew
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories <b>- </b>based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an <it>O</it>(<it>n/p</it>) time parallel algorithm has been given for this problem. Here <it>n </it>is the size of the input and <it>p </it>is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(<it>n</it>Σ) messages (Σ being the size of the alphabet).</p> <p>Results</p> <p>In this paper we present a Θ(<it>n/p</it>) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of <inline-formula><m:math name="1471-2105-11-560-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow><m:mo>Θ</m:mo><m:mo stretchy="false">(</m:mo><m:mfrac><m:mrow><m:mi>n</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>n</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow><m:mrow><m:mi>B</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>M</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac><m:mo stretchy="false">)</m:mo></m:mrow></m:math></inline-formula> (<it>M </it>being the main memory size and <it>B </it>being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster <b>- </b>both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.</p> <p>Conclusions</p> <p>The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.</p>
first_indexed	2024-12-22T06:04:28Z
format	Article
id	doaj.art-51fe4b97cde54b25abde63d1a19714ae
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-22T06:04:28Z
publishDate	2010-11-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-51fe4b97cde54b25abde63d1a19714ae2022-12-21T18:36:27ZengBMCBMC Bioinformatics1471-21052010-11-0111156010.1186/1471-2105-11-560Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphsVaughn MatthewDinh HieuRajasekaran SanguthevarKundeti Vamsi KThapar Vishal<p>Abstract</p> <p>Background</p> <p>Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories <b>- </b>based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an <it>O</it>(<it>n/p</it>) time parallel algorithm has been given for this problem. Here <it>n </it>is the size of the input and <it>p </it>is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(<it>n</it>Σ) messages (Σ being the size of the alphabet).</p> <p>Results</p> <p>In this paper we present a Θ(<it>n/p</it>) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of <inline-formula><m:math name="1471-2105-11-560-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow><m:mo>Θ</m:mo><m:mo stretchy="false">(</m:mo><m:mfrac><m:mrow><m:mi>n</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>n</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow><m:mrow><m:mi>B</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>M</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac><m:mo stretchy="false">)</m:mo></m:mrow></m:math></inline-formula> (<it>M </it>being the main memory size and <it>B </it>being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster <b>- </b>both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.</p> <p>Conclusions</p> <p>The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.</p>http://www.biomedcentral.com/1471-2105/11/560
spellingShingle	Vaughn Matthew Dinh Hieu Rajasekaran Sanguthevar Kundeti Vamsi K Thapar Vishal Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs BMC Bioinformatics
title	Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs
title_full	Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs
title_fullStr	Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs
title_full_unstemmed	Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs
title_short	Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs
title_sort	efficient parallel and out of core algorithms for constructing large bi directed de bruijn graphs
url	http://www.biomedcentral.com/1471-2105/11/560
work_keys_str_mv	AT vaughnmatthew efficientparallelandoutofcorealgorithmsforconstructinglargebidirecteddebruijngraphs AT dinhhieu efficientparallelandoutofcorealgorithmsforconstructinglargebidirecteddebruijngraphs AT rajasekaransanguthevar efficientparallelandoutofcorealgorithmsforconstructinglargebidirecteddebruijngraphs AT kundetivamsik efficientparallelandoutofcorealgorithmsforconstructinglargebidirecteddebruijngraphs AT thaparvishal efficientparallelandoutofcorealgorithmsforconstructinglargebidirecteddebruijngraphs

Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs

Similar Items