Bayesian localization of CNV candidates in WGS data within minutes

Abstract Background Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward–Backward Gibbs sampling using dynamic Haar wavelet compression has...

Full description

Bibliographic Details
Main Authors: John Wiedenhoeft, Alex Cagan, Rimma Kozhemyakina, Rimma Gulevich, Alexander Schliep
Format: Article
Language:English
Published: BMC 2019-09-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-019-0154-7
_version_ 1818533753969967104
author John Wiedenhoeft
Alex Cagan
Rimma Kozhemyakina
Rimma Gulevich
Alexander Schliep
author_facet John Wiedenhoeft
Alex Cagan
Rimma Kozhemyakina
Rimma Gulevich
Alexander Schliep
author_sort John Wiedenhoeft
collection DOAJ
description Abstract Background Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward–Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice. Results In this paper, we propose an improved algorithmic framework for this approach. We provide new space-efficient data structures to query sufficient statistics in logarithmic time, based on a linear-time, in-place transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler. Conclusions Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop.
first_indexed 2024-12-11T18:02:55Z
format Article
id doaj.art-4893cd7a92b74bd1b6c203833c7d5319
institution Directory Open Access Journal
issn 1748-7188
language English
last_indexed 2024-12-11T18:02:55Z
publishDate 2019-09-01
publisher BMC
record_format Article
series Algorithms for Molecular Biology
spelling doaj.art-4893cd7a92b74bd1b6c203833c7d53192022-12-22T00:55:50ZengBMCAlgorithms for Molecular Biology1748-71882019-09-0114111610.1186/s13015-019-0154-7Bayesian localization of CNV candidates in WGS data within minutesJohn Wiedenhoeft0Alex Cagan1Rimma Kozhemyakina2Rimma Gulevich3Alexander Schliep4Department of Computer Science and Engineering, University of Gothenburg | ChalmersMax Planck Institute for Evolutionary AnthropologyInstitute of Cytology and Genetics of the Siberian Branch of the Russian Academy of SciencesInstitute of Cytology and Genetics of the Siberian Branch of the Russian Academy of SciencesDepartment of Computer Science and Engineering, University of Gothenburg | ChalmersAbstract Background Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward–Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice. Results In this paper, we propose an improved algorithmic framework for this approach. We provide new space-efficient data structures to query sufficient statistics in logarithmic time, based on a linear-time, in-place transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler. Conclusions Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop.http://link.springer.com/article/10.1186/s13015-019-0154-7HMMWaveletCNVBayesian inference
spellingShingle John Wiedenhoeft
Alex Cagan
Rimma Kozhemyakina
Rimma Gulevich
Alexander Schliep
Bayesian localization of CNV candidates in WGS data within minutes
Algorithms for Molecular Biology
HMM
Wavelet
CNV
Bayesian inference
title Bayesian localization of CNV candidates in WGS data within minutes
title_full Bayesian localization of CNV candidates in WGS data within minutes
title_fullStr Bayesian localization of CNV candidates in WGS data within minutes
title_full_unstemmed Bayesian localization of CNV candidates in WGS data within minutes
title_short Bayesian localization of CNV candidates in WGS data within minutes
title_sort bayesian localization of cnv candidates in wgs data within minutes
topic HMM
Wavelet
CNV
Bayesian inference
url http://link.springer.com/article/10.1186/s13015-019-0154-7
work_keys_str_mv AT johnwiedenhoeft bayesianlocalizationofcnvcandidatesinwgsdatawithinminutes
AT alexcagan bayesianlocalizationofcnvcandidatesinwgsdatawithinminutes
AT rimmakozhemyakina bayesianlocalizationofcnvcandidatesinwgsdatawithinminutes
AT rimmagulevich bayesianlocalizationofcnvcandidatesinwgsdatawithinminutes
AT alexanderschliep bayesianlocalizationofcnvcandidatesinwgsdatawithinminutes