Linear time complexity de novo long read genome assembly with GoldRush

Abstract Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap – its most costly step – was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human da...

Full description

Bibliographic Details
Main Authors:	Johnathan Wong, Lauren Coombe, Vladimir Nikolić, Emily Zhang, Ka Ming Nip, Puneet Sidhu, René L. Warren, Inanç Birol
Format:	Article
Language:	English
Published:	Nature Portfolio 2023-05-01
Series:	Nature Communications
Online Access:	https://doi.org/10.1038/s41467-023-38716-x

_version_	1797817945857982464
author	Johnathan Wong Lauren Coombe Vladimir Nikolić Emily Zhang Ka Ming Nip Puneet Sidhu René L. Warren Inanç Birol
author_facet	Johnathan Wong Lauren Coombe Vladimir Nikolić Emily Zhang Ka Ming Nip Puneet Sidhu René L. Warren Inanç Birol
author_sort	Johnathan Wong
collection	DOAJ
description	Abstract Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap – its most costly step – was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.
first_indexed	2024-03-13T09:00:51Z
format	Article
id	doaj.art-d6a30448c4c845bb9eefc3dcce11fd5d
institution	Directory Open Access Journal
issn	2041-1723
language	English
last_indexed	2024-03-13T09:00:51Z
publishDate	2023-05-01
publisher	Nature Portfolio
record_format	Article
series	Nature Communications
spelling	doaj.art-d6a30448c4c845bb9eefc3dcce11fd5d2023-05-28T11:22:15ZengNature PortfolioNature Communications2041-17232023-05-011411910.1038/s41467-023-38716-xLinear time complexity de novo long read genome assembly with GoldRushJohnathan Wong0Lauren Coombe1Vladimir Nikolić2Emily Zhang3Ka Ming Nip4Puneet Sidhu5René L. Warren6Inanç Birol7Canada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerAbstract Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap – its most costly step – was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.https://doi.org/10.1038/s41467-023-38716-x
spellingShingle	Johnathan Wong Lauren Coombe Vladimir Nikolić Emily Zhang Ka Ming Nip Puneet Sidhu René L. Warren Inanç Birol Linear time complexity de novo long read genome assembly with GoldRush Nature Communications
title	Linear time complexity de novo long read genome assembly with GoldRush
title_full	Linear time complexity de novo long read genome assembly with GoldRush
title_fullStr	Linear time complexity de novo long read genome assembly with GoldRush
title_full_unstemmed	Linear time complexity de novo long read genome assembly with GoldRush
title_short	Linear time complexity de novo long read genome assembly with GoldRush
title_sort	linear time complexity de novo long read genome assembly with goldrush
url	https://doi.org/10.1038/s41467-023-38716-x
work_keys_str_mv	AT johnathanwong lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush AT laurencoombe lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush AT vladimirnikolic lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush AT emilyzhang lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush AT kamingnip lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush AT puneetsidhu lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush AT renelwarren lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush AT inancbirol lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush

Linear time complexity de novo long read genome assembly with GoldRush

Similar Items