Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe th...

Full description

Bibliographic Details
Main Authors: Ralph, P, Thornton, K, Kelleher, J
Format: Journal article
Language:English
Published: Genetics Society of America 2020
_version_ 1797086595383296000
author Ralph, P
Thornton, K
Kelleher, J
author_facet Ralph, P
Thornton, K
Kelleher, J
author_sort Ralph, P
collection OXFORD
description As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates "sample weights" within the genealogical tree at each position on the genome, which are then combined using a "summary function"; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite-sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently-defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding "branch" statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project dataset, and discuss ways in which deviations may encode interesting biological signals.
first_indexed 2024-03-07T02:24:07Z
format Journal article
id oxford-uuid:a500482b-dffe-438b-b91b-27efc93f7ef6
institution University of Oxford
language English
last_indexed 2024-03-07T02:24:07Z
publishDate 2020
publisher Genetics Society of America
record_format dspace
spelling oxford-uuid:a500482b-dffe-438b-b91b-27efc93f7ef62022-03-27T02:37:30ZEfficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:a500482b-dffe-438b-b91b-27efc93f7ef6EnglishSymplectic ElementsGenetics Society of America2020Ralph, PThornton, KKelleher, JAs a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates "sample weights" within the genealogical tree at each position on the genome, which are then combined using a "summary function"; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite-sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently-defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding "branch" statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project dataset, and discuss ways in which deviations may encode interesting biological signals.
spellingShingle Ralph, P
Thornton, K
Kelleher, J
Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes
title Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes
title_full Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes
title_fullStr Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes
title_full_unstemmed Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes
title_short Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes
title_sort efficiently summarizing relationships in large samples a general duality between statistics of genealogies and genomes
work_keys_str_mv AT ralphp efficientlysummarizingrelationshipsinlargesamplesageneraldualitybetweenstatisticsofgenealogiesandgenomes
AT thorntonk efficientlysummarizingrelationshipsinlargesamplesageneraldualitybetweenstatisticsofgenealogiesandgenomes
AT kelleherj efficientlysummarizingrelationshipsinlargesamplesageneraldualitybetweenstatisticsofgenealogiesandgenomes