Discovering genetic variation in populations using next generation sequencing and de novo assembly

<p>The de Bruijn Graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, sequencing error correction, sequencing read compression and variant discovery. The data structure has a single parameter k, is straightforward to implemen...

Celý popis

Podrobná bibliografie
Hlavní autor: Turner, I
Další autoři: McVean, G
Médium: Diplomová práce
Jazyk:English
Vydáno: 2018
Témata:
Popis
Shrnutí:<p>The de Bruijn Graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, sequencing error correction, sequencing read compression and variant discovery. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn Graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn Graphs can produce sub-optimal results given their input.</p> <p>We present a novel assembly graph data structure: the Linked de Bruijn Graph. Constructed by adding annotations on top of a de Bruijn Graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn Graph. Through simulation we show that links improve performance and reduce the sensitivity to the parameter k. Many algorithms can be built on top of Linked de Bruijn Graphs, which we illustrate by implementing read error correction, variant calling, genotyping and assembly algorithms in a software package called ‘McCortex’.</p> <p>With assembly simulations we demonstrate that the Linked de Bruijn Graph data structure outperforms both the de Bruijn Graph and the String Graph Assembler (SGA). Using human whole genome sequence, we show that Linked de Bruijn Graphs scale up to mammalian genomes. Finally we apply McCortex to Klebsiella pneumoniae short read data to call SNPs, indels and large events, which we validate using PacBio sequencing data. Although kmer-based methods have reduced sensitivity for calling SNPs, McCortex finds 3x more polymorphic bases due to small events in K. pneumoniae versus mapping-based approaches and 48% more than existing de Bruijn Graph de novo assembly methods. We also find multiple 10kbp events that cannot be resolved with a plain de Bruijn Graph.</p> <p>In conclusion the Linked de Bruijn Graph is a versatile data structure with utility in multi-sample variant calling of indels and large events, in both de novo and reference-aware scenarios.</p>