Large scale sequence alignment via efficient inference in generative models

Abstract Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretic...

Full description

Bibliographic Details
Main Authors:	Mihir Mongia, Chengze Shen, Arash Gholami Davoodi, Guillaume Marçais, Hosein Mohimani
Format:	Article
Language:	English
Published:	Nature Portfolio 2023-05-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-023-34257-x

_version_	1797832087317774336
author	Mihir Mongia Chengze Shen Arash Gholami Davoodi Guillaume Marçais Hosein Mohimani
author_facet	Mihir Mongia Chengze Shen Arash Gholami Davoodi Guillaume Marçais Hosein Mohimani
author_sort	Mihir Mongia
collection	DOAJ
description	Abstract Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences.
first_indexed	2024-04-09T14:02:07Z
format	Article
id	doaj.art-5c3408cd12bf4298a946409433ace604
institution	Directory Open Access Journal
issn	2045-2322
language	English
last_indexed	2024-04-09T14:02:07Z
publishDate	2023-05-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj.art-5c3408cd12bf4298a946409433ace6042023-05-07T11:13:49ZengNature PortfolioScientific Reports2045-23222023-05-0113111110.1038/s41598-023-34257-xLarge scale sequence alignment via efficient inference in generative modelsMihir Mongia0Chengze Shen1Arash Gholami Davoodi2Guillaume Marçais3Hosein Mohimani4School Computer Science, Carnegie Mellon UniversitySchool Computer Science, Carnegie Mellon UniversitySchool Computer Science, Carnegie Mellon UniversitySchool Computer Science, Carnegie Mellon UniversitySchool Computer Science, Carnegie Mellon UniversityAbstract Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences.https://doi.org/10.1038/s41598-023-34257-x
spellingShingle	Mihir Mongia Chengze Shen Arash Gholami Davoodi Guillaume Marçais Hosein Mohimani Large scale sequence alignment via efficient inference in generative models Scientific Reports
title	Large scale sequence alignment via efficient inference in generative models
title_full	Large scale sequence alignment via efficient inference in generative models
title_fullStr	Large scale sequence alignment via efficient inference in generative models
title_full_unstemmed	Large scale sequence alignment via efficient inference in generative models
title_short	Large scale sequence alignment via efficient inference in generative models
title_sort	large scale sequence alignment via efficient inference in generative models
url	https://doi.org/10.1038/s41598-023-34257-x
work_keys_str_mv	AT mihirmongia largescalesequencealignmentviaefficientinferenceingenerativemodels AT chengzeshen largescalesequencealignmentviaefficientinferenceingenerativemodels AT arashgholamidavoodi largescalesequencealignmentviaefficientinferenceingenerativemodels AT guillaumemarcais largescalesequencealignmentviaefficientinferenceingenerativemodels AT hoseinmohimani largescalesequencealignmentviaefficientinferenceingenerativemodels

Large scale sequence alignment via efficient inference in generative models

Similar Items