Accounting for long-range correlations in genome-wide simulations of large cohorts.
Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individ...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2020-05-01
|
Series: | PLoS Genetics |
Online Access: | https://doi.org/10.1371/journal.pgen.1008619 |
_version_ | 1819141353486942208 |
---|---|
author | Dominic Nelson Jerome Kelleher Aaron P Ragsdale Claudia Moreau Gil McVean Simon Gravel |
author_facet | Dominic Nelson Jerome Kelleher Aaron P Ragsdale Claudia Moreau Gil McVean Simon Gravel |
author_sort | Dominic Nelson |
collection | DOAJ |
description | Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past. |
first_indexed | 2024-12-22T11:53:06Z |
format | Article |
id | doaj.art-35553f55243a41c2a67cfd265944473f |
institution | Directory Open Access Journal |
issn | 1553-7390 1553-7404 |
language | English |
last_indexed | 2024-12-22T11:53:06Z |
publishDate | 2020-05-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS Genetics |
spelling | doaj.art-35553f55243a41c2a67cfd265944473f2022-12-21T18:26:55ZengPublic Library of Science (PLoS)PLoS Genetics1553-73901553-74042020-05-01165e100861910.1371/journal.pgen.1008619Accounting for long-range correlations in genome-wide simulations of large cohorts.Dominic NelsonJerome KelleherAaron P RagsdaleClaudia MoreauGil McVeanSimon GravelCoalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.https://doi.org/10.1371/journal.pgen.1008619 |
spellingShingle | Dominic Nelson Jerome Kelleher Aaron P Ragsdale Claudia Moreau Gil McVean Simon Gravel Accounting for long-range correlations in genome-wide simulations of large cohorts. PLoS Genetics |
title | Accounting for long-range correlations in genome-wide simulations of large cohorts. |
title_full | Accounting for long-range correlations in genome-wide simulations of large cohorts. |
title_fullStr | Accounting for long-range correlations in genome-wide simulations of large cohorts. |
title_full_unstemmed | Accounting for long-range correlations in genome-wide simulations of large cohorts. |
title_short | Accounting for long-range correlations in genome-wide simulations of large cohorts. |
title_sort | accounting for long range correlations in genome wide simulations of large cohorts |
url | https://doi.org/10.1371/journal.pgen.1008619 |
work_keys_str_mv | AT dominicnelson accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts AT jeromekelleher accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts AT aaronpragsdale accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts AT claudiamoreau accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts AT gilmcvean accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts AT simongravel accountingforlongrangecorrelationsingenomewidesimulationsoflargecohorts |