Development and application of computational methods to study DNA modifications

<p>The epigenetic modifications of DNA shape cell fate in development, differentiation, and disease. The existing gold standard sequencing technologies for epigenetic DNA modifications are based on sodium bisulfite, which is a harsh chemical treatment resulting in DNA degradation. A novel bisu...

Full description

Bibliographic Details
Main Author: Velikova, GV
Other Authors: Schuster-Böckler, B
Format: Thesis
Language:English
Published: 2020
Subjects:
Description
Summary:<p>The epigenetic modifications of DNA shape cell fate in development, differentiation, and disease. The existing gold standard sequencing technologies for epigenetic DNA modifications are based on sodium bisulfite, which is a harsh chemical treatment resulting in DNA degradation. A novel bisulfite-free and base-resolution sequencing method, TET Assisted Pyridine-borane Sequencing (TAPS) was developed to detect the most abundant DNA modifications. In comparison to bisulfite sequencing (BS), TAPS relies on mild reactions for the detection of modified bases. From a bioinformatics perspective, sodium bisulfite substantially reduces information content, complicating data processing and the detection of genetic variation. In fact, most existing modification calling tools for bisulfite-treated data do not distinguish between modifications and genetic variants, which results in false positives. A computational tool, asTair, was created to process DNA modification sequencing data. It was designed primarily for handling TAPS sequencing output, but also contains functions that are useful for bisulfite sequencing data analyses. It was shown that TAPS has more even coverage than BS while having a comparable conversion rate over CpGs, and is applicable to low input samples. A Deep Neural Network (DNN) model that detects single-nucleotide variants in TAPS- and BS-converted sequencing data was created to enable sensitive modification and variant calling. The algorithm showed precision and recall above 0.9 for classifying variants, modifications and reference positions. The model outperformed available variant callers for whole-genome sequencing and BS data. Applying such a model on real datasets could improve the accuracy of identifying real DNA modifications masked by genetic variation and errors, as around a sixth of all SNPs could be misclassified as modifications.</p>