Summary: | Genomic sequences can be prone to breakages, where the particularly fragile DNA sequence spans can cause genomic instabilities and contribute to diseases such as cancer. Unlike the research in point mutations, the relationship between DNA sequence context and the propensity for strand breaks remains elusive. By analysing the differences and commonalities across various DNA breakage datasets, this thesis identifies strong sequence-driven patterns influencing DNA fragility. We showed the overall deconvolution of the sequence influences into short-, medium-, and long-range effects. The short-range k-meric fragility scores of all processed DNA breakage datasets were quantified and summarised as a feature library (DNAfrAIlib), designed for seamless integration during feature generation for any sequence-based machine learning task, where accounting for DNA fragility could be useful. We employed these features to develop a generalised machine learning model for DNA fragility that is trained on cancer-associated breaks. Applying our model to the entire human genome, we found that structural variants, especially the pathogenic ones, tend to stabilise the regions once they emerge, while chromothripsis events favour less fragile genomic regions. We found that viral integration, especially those of cancer-associated viruses, into the human host could increase genomic fragility. We showed that absent sequences were more fragile than the human genome average. As a proof of concept, we found that incorporating our understanding in the sequence basis of DNA fragility can improve de novo genome assembly algorithms, by aiding the selection of higher-quality sequences out of all assembled variants. Overall, this work offers novel insights into the sequence basis of DNA fragility and presents a powerful machine learning resource to further enhance our understanding in genome instability and evolution.
|