Accounting for long-range correlations in genome-wide simulations of large cohorts.

Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individ...

Full description

Bibliographic Details
Main Authors: Dominic Nelson, Jerome Kelleher, Aaron P Ragsdale, Claudia Moreau, Gil McVean, Simon Gravel
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-05-01
Series:PLoS Genetics
Online Access:https://doi.org/10.1371/journal.pgen.1008619
_version_ 1819141353486942208
author Dominic Nelson
Jerome Kelleher
Aaron P Ragsdale
Claudia Moreau
Gil McVean
Simon Gravel
author_facet Dominic Nelson
Jerome Kelleher
Aaron P Ragsdale
Claudia Moreau
Gil McVean
Simon Gravel
author_sort Dominic Nelson
collection DOAJ
description Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.
first_indexed 2024-12-22T11:53:06Z
format Article
id doaj.art-35553f55243a41c2a67cfd265944473f
institution Directory Open Access Journal
issn 1553-7390
1553-7404
language English
last_indexed 2024-12-22T11:53:06Z
publishDate 2020-05-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Genetics
spelling doaj.art-35553f55243a41c2a67cfd265944473f2022-12-21T18:26:55ZengPublic Library of Science (PLoS)PLoS Genetics1553-73901553-74042020-05-01165e100861910.1371/journal.pgen.1008619Accounting for long-range correlations in genome-wide simulations of large cohorts.Dominic NelsonJerome KelleherAaron P RagsdaleClaudia MoreauGil McVeanSimon GravelCoalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.https://doi.org/10.1371/journal.pgen.1008619
spellingShingle Dominic Nelson
Jerome Kelleher
Aaron P Ragsdale
Claudia Moreau
Gil McVean
Simon Gravel
Accounting for long-range correlations in genome-wide simulations of large cohorts.
PLoS Genetics
title Accounting for long-range correlations in genome-wide simulations of large cohorts.
title_full Accounting for long-range correlations in genome-wide simulations of large cohorts.
title_fullStr Accounting for long-range correlations in genome-wide simulations of large cohorts.
title_full_unstemmed Accounting for long-range correlations in genome-wide simulations of large cohorts.
title_short Accounting for long-range correlations in genome-wide simulations of large cohorts.
title_sort accounting for long range correlations in genome wide simulations of large cohorts
url https://doi.org/10.1371/journal.pgen.1008619
work_keys_str_mv AT dominicnelson accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts
AT jeromekelleher accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts
AT aaronpragsdale accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts
AT claudiamoreau accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts
AT gilmcvean accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts
AT simongravel accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts