Methods for phasing and imputation of very low coverage sequencing data

<p>The introduction of massively parallel short-read sequencing has facilitated rapidly dropping costs of DNA sequencing. This has led to substantial growth in the size of human sequencing projects, with consortia of low coverage sequencing data containing tens of thousands of samples. However...

पूर्ण विवरण

ग्रंथसूची विवरण
मुख्य लेखक: Kretzschmar, WW
अन्य लेखक: Marchini, J
स्वरूप: थीसिस
भाषा:English
प्रकाशित: 2016
विषय:
विवरण
सारांश:<p>The introduction of massively parallel short-read sequencing has facilitated rapidly dropping costs of DNA sequencing. This has led to substantial growth in the size of human sequencing projects, with consortia of low coverage sequencing data containing tens of thousands of samples. However, current statistical methods for genotype calling from this data scale poorly with sample size, and are infeasible to use on the largest of current projects. This thesis explores the problem of genotype calling and phasing of large sample sizes of low-coverage sequencing data.</p> <p>Current methods are applied to call and phase genotypes of the CONVERGE consortium, a data set consisting of very low coverage next-generation sequencing data collected from around 12,000 Chinese women. A genotyping accuracy of 92% as measured by squared Pearson correlation (R2) against a SNP geno-typing chip is achieved for minor allele frequencies &amp;GT;5%, demonstrating that very low coverage sequencing can be used instead of SNP genotyping chips to genotype a study of this size.</p> <p>A new statistical model is described that allows genotype calling and phasing of low coverage sequencing data in N(logN) time complexity, where N is sample size, which greatly improves run time compared to current methods. Other adaptations of the model, including a GPU implementation, are also presented.</p> <p>The new statistical model is used to call and phase genotypes from the largest collection of low coverage sequencing data in the world (about 32,000 Europeans), the Haplotype Reference Consortium (HRC). At a non-reference allele frequency of 0.1% the HRC haplotypes provide a downstream imputation accuracy of up to 64% R2, compared to an R2 of 36% when using 1000 Genomes Phase 3 haplotypes, the largest publicly available collection of haplotypes derived from low coverage sequencing.</p> <p>Finally, a web server has been written to allow small numbers of high coverage whole genome sequenced samples to be phased using the HRC panel. The HRC panel is only available to HRC consortium members, but this web server allows the academic community to gain access to the HRC panel for phasing their own samples.</p>