Linear time complexity de novo long read genome assembly with GoldRush

Abstract Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap – its most costly step – was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human da...

Full description

Bibliographic Details
Main Authors: Johnathan Wong, Lauren Coombe, Vladimir Nikolić, Emily Zhang, Ka Ming Nip, Puneet Sidhu, René L. Warren, Inanç Birol
Format: Article
Language:English
Published: Nature Portfolio 2023-05-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-023-38716-x
_version_ 1797817945857982464
author Johnathan Wong
Lauren Coombe
Vladimir Nikolić
Emily Zhang
Ka Ming Nip
Puneet Sidhu
René L. Warren
Inanç Birol
author_facet Johnathan Wong
Lauren Coombe
Vladimir Nikolić
Emily Zhang
Ka Ming Nip
Puneet Sidhu
René L. Warren
Inanç Birol
author_sort Johnathan Wong
collection DOAJ
description Abstract Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap – its most costly step – was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.
first_indexed 2024-03-13T09:00:51Z
format Article
id doaj.art-d6a30448c4c845bb9eefc3dcce11fd5d
institution Directory Open Access Journal
issn 2041-1723
language English
last_indexed 2024-03-13T09:00:51Z
publishDate 2023-05-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj.art-d6a30448c4c845bb9eefc3dcce11fd5d2023-05-28T11:22:15ZengNature PortfolioNature Communications2041-17232023-05-011411910.1038/s41467-023-38716-xLinear time complexity de novo long read genome assembly with GoldRushJohnathan Wong0Lauren Coombe1Vladimir Nikolić2Emily Zhang3Ka Ming Nip4Puneet Sidhu5René L. Warren6Inanç Birol7Canada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerCanada’s Michael Smith Genome Sciences Centre, BC CancerAbstract Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap – its most costly step – was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.https://doi.org/10.1038/s41467-023-38716-x
spellingShingle Johnathan Wong
Lauren Coombe
Vladimir Nikolić
Emily Zhang
Ka Ming Nip
Puneet Sidhu
René L. Warren
Inanç Birol
Linear time complexity de novo long read genome assembly with GoldRush
Nature Communications
title Linear time complexity de novo long read genome assembly with GoldRush
title_full Linear time complexity de novo long read genome assembly with GoldRush
title_fullStr Linear time complexity de novo long read genome assembly with GoldRush
title_full_unstemmed Linear time complexity de novo long read genome assembly with GoldRush
title_short Linear time complexity de novo long read genome assembly with GoldRush
title_sort linear time complexity de novo long read genome assembly with goldrush
url https://doi.org/10.1038/s41467-023-38716-x
work_keys_str_mv AT johnathanwong lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush
AT laurencoombe lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush
AT vladimirnikolic lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush
AT emilyzhang lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush
AT kamingnip lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush
AT puneetsidhu lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush
AT renelwarren lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush
AT inancbirol lineartimecomplexitydenovolongreadgenomeassemblywithgoldrush