A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

Abstract Background State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads wit...

Full description

Bibliographic Details
Main Authors:	Chang Yu-Jung, Chen Chien-Chih, Chen Chuen-Liang, Ho Jan-Ming
Format:	Article
Language:	English
Published:	BMC 2012-12-01
Series:	BMC Genomics

_version_	1828421559475240960
author	Chang Yu-Jung Chen Chien-Chih Chen Chuen-Liang Ho Jan-Ming
author_facet	Chang Yu-Jung Chen Chien-Chih Chen Chuen-Liang Ho Jan-Ming
author_sort	Chang Yu-Jung
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for <it>de novo </it>assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms.</p> <p>Results</p> <p>We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at <url>https://github.com/ice91/CloudBrush</url>.</p>
first_indexed	2024-12-10T15:33:14Z
format	Article
id	doaj.art-cf66067bf5e5455680c2351aeeec7a90
institution	Directory Open Access Journal
issn	1471-2164
language	English
last_indexed	2024-12-10T15:33:14Z
publishDate	2012-12-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj.art-cf66067bf5e5455680c2351aeeec7a902022-12-22T01:43:19ZengBMCBMC Genomics1471-21642012-12-0113Suppl 7S2810.1186/1471-2164-13-S7-S28A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing frameworkChang Yu-JungChen Chien-ChihChen Chuen-LiangHo Jan-Ming<p>Abstract</p> <p>Background</p> <p>State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for <it>de novo </it>assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms.</p> <p>Results</p> <p>We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at <url>https://github.com/ice91/CloudBrush</url>.</p>
spellingShingle	Chang Yu-Jung Chen Chien-Chih Chen Chuen-Liang Ho Jan-Ming A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework BMC Genomics
title	A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
title_full	A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
title_fullStr	A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
title_full_unstemmed	A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
title_short	A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
title_sort	de novo next generation genomic sequence assembler based on string graph and mapreduce cloud computing framework
work_keys_str_mv	AT changyujung adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchienchih adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchuenliang adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT hojanming adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT changyujung denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchienchih denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchuenliang denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT hojanming denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework

A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

Similar Items