Genotype imputation methods for next generation datasets

<p>The advent of genome-wide association studies (GWAS) revolutionized the field of complex disease genetics. The primary goal of these studies is to achieve a better understanding of the biology of a complex disease and use this knowledge for prevention or better treatment of diseases. Genoty...

Full description

Bibliographic Details
Main Author: Rubinacci, S
Other Authors: Marchini, J
Format: Thesis
Language:English
Published: 2020
Subjects:
Description
Summary:<p>The advent of genome-wide association studies (GWAS) revolutionized the field of complex disease genetics. The primary goal of these studies is to achieve a better understanding of the biology of a complex disease and use this knowledge for prevention or better treatment of diseases. Genotype imputation is one of the key steps in such studies. The contribution of genotype imputation is to increase the power of study by boosting the coverage of genetic variation and to provide a natural framework for combining results across association studies that rely on different genotyping platforms.</p> <p>Genotype imputation is the process of statistical inference of unobserved genotypes in a sample of individuals. In the typical scenario, a reference panel of haplotypes is used to infer ungenotyped variants in a set of study samples. Increasing the sample size in the reference panel improves imputation accuracy, especially for variants with low minor allele frequencies, but big reference panels are also a computational challenge for imputation methods. The goal of this dissertation is to develop methods and protocols that scale genotype imputation to very large next generation reference panels.</p> <p>The main contribution of this dissertation is a genotype imputation method named IMPUTE5. It achieves fast, accurate, and memory-efficient imputation by selecting a small number of reference panel haplotypes using the Positional Burrows-Wheeler Transform. Imputation is performed only using the selected haplotypes, leading to a dramatic speed-up with no loss in accuracy compared to other methods. In order to facilitate other researchers to perform GWAS, a protocol to impute from a reference panel of phased haplotypes into a genome-wide association dataset is also described.</p> <p>We also present an efficient C++ implementation of the Positional Burrows-Wheeler Transform which allows fast string matching in a set of haplotypes. An evaluation of previous state selection algorithms is provided together with a qualitative measure of the accuracy of chromosome painting performed by IMPUTE5.</p> <p>A major application is the imputation of the UK Biobank dataset using the 100,000 Genomes Project reference panel. We show how the reference panel has been created and how imputation will be performed. Imputation of the UK Biobank represents an extremely valuable resource for researches, potentially providing new highlights for genome-wide association studies.</p>