Summary: | <p>The advent of genome-wide association studies (GWAS) revolutionized the field of complex disease genetics. The primary goal of these studies is to achieve a
better understanding of the biology of a complex disease and use this knowledge for prevention or better treatment of diseases. Genotype imputation is one of
the key steps in such studies. The contribution of genotype imputation is to increase the power of study by boosting the coverage of genetic variation and to
provide a natural framework for combining results across association studies that rely on different genotyping platforms.</p>
<p>Genotype imputation is the process of statistical inference of unobserved genotypes in a sample of individuals. In the typical scenario, a reference panel of
haplotypes is used to infer ungenotyped variants in a set of study samples. Increasing the sample size in the reference panel improves imputation accuracy, especially for variants with low minor allele frequencies, but big reference panels are also a computational challenge for imputation methods. The goal of this dissertation is to develop methods and protocols that scale genotype imputation to very large next generation reference panels.</p>
<p>The main contribution of this dissertation is a genotype imputation method named IMPUTE5. It achieves fast, accurate, and memory-efficient imputation
by selecting a small number of reference panel haplotypes using the Positional Burrows-Wheeler Transform. Imputation is performed only using the selected
haplotypes, leading to a dramatic speed-up with no loss in accuracy compared to other methods. In order to facilitate other researchers to perform GWAS, a
protocol to impute from a reference panel of phased haplotypes into a genome-wide association dataset is also described.</p>
<p>We also present an efficient C++ implementation of the Positional Burrows-Wheeler Transform which allows fast string matching in a set of haplotypes. An
evaluation of previous state selection algorithms is provided together with a qualitative measure of the accuracy of chromosome painting performed by IMPUTE5.</p>
<p>A major application is the imputation of the UK Biobank dataset using the 100,000 Genomes Project reference panel. We show how the reference panel has
been created and how imputation will be performed. Imputation of the UK Biobank represents an extremely valuable resource for researches, potentially providing new highlights for genome-wide association studies.</p>
|