Sumari: | <p>Mutations that affect RNA splicing can have severe phenotypic consequences, and contribute to rare and sporadic human disease. Whole-genome sequencing promises to improve diagnosis, but it is often difficult to identify mutations that disrupt splicing, except when they affect canonical donor and acceptor sites. In particular, cryptic mutations that cause novel splice junctions to appear deep within introns of genes are very hard to identify. Accurate computational models are therefore crucial for effective diagnosis.</p>
<p>The availability of large amounts of expression data across a range of genomes and cell types, and the development of neural network technologies that can learn features directly from data, together provide an opportunity for developing computational models that predict splicing activity directly from genome sequence.</p>
<p>In this work, I first develop and validate a methodology to improve the training of neural networks operating on sequence data using the formalism of equivariant maps. We demonstrated increased representational stability for networks constructed using these techniques and subsequently used this approach to discover and quantify novel binding sites for PRDM9.</p>
<p>Next, I develop models to predict splice site location and exon inclusion ratios simultaneously using modifications to the dilated convolutional network and show state-of-the-art performance at this task. I apply these models to mutation databases and find that the model can make interpretable predictions about the consequences of deep intronic mutations, explaining 27% of pathogenic cryptic splice variants.</p>
|