Learning graph representations for disease gene prediction

The analysis of disease-causing conditions based on genes and their protein products plays a crucial role in the diagnosis and treatment of several serious diseases such as cancer and diabetes. Since experimental techniques are time-consuming and expensive, computational methods preserve their signi...

Full description

Bibliographic Details
Main Author: Ata, Kircali Sezin
Other Authors: -
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137913
Description
Summary:The analysis of disease-causing conditions based on genes and their protein products plays a crucial role in the diagnosis and treatment of several serious diseases such as cancer and diabetes. Since experimental techniques are time-consuming and expensive, computational methods preserve their significance in revealing the functional roles of genes/proteins in the context of many diseases. Molecular networks such as protein-protein interaction and co-expression networks are useful computational tools to figure out the underlying mechanism of diseases by studying the complex interplay between genes. However, molecular networks are often noisy and incomplete. Therefore, computational approaches are designed in such a way that they are able to complement and enhance network data. The vast amount of cumulated data from various scientific domains are exploited to extract useful information for disease gene prediction efficiently. The goal of this research is to develop techniques that extract robust network-based feature representations for prediction of disease-causing genes with a wider perspective by combining both topological characteristics of molecular networks and biological knowledge such as gene ontology and protein domain. However, exploiting the topological arrangements of proteins in the context of not only the interactions between them but also their relevance based on other biological properties is another challenge. Furthermore, hand-engineering to extract complex network features needs tedious efforts with domain expertise. Automating the extraction of these topological features and combining them with the biological aspects of proteins is a challenge to be addressed. For this purpose, node embedding models are applied to automate the extraction of useful feature representations for disease gene prediction. Besides, there are various molecular networks that provide deeper insight into genes and diseases from multiple perspectives. Thus, a unified computational framework can leverage these various perspectives to generate more robust representations. The design of models, which are able to automate the extraction of feature representations through multiple networks is essential to achieve higher prediction performance. In this thesis, we propose three computational frameworks to predict candidate disease genes: 1. The first proposed computational method aims to predict candidate disease genes, which is called Metagraph. To complement and enrich the protein-protein interaction (PPI) networks, Metagraph leverages the biological properties of the individual proteins by integrating the ontological properties, named as keywords, of proteins into the PPI network, and constructs a novel PPI-Keywords (PPIK) network composed of both proteins and keywords as two different types of nodes. As disease proteins tend to exhibit similar topological properties on the PPIK network, we further propose to represent proteins with metagraphs. Apart from a traditional network motif or subgraph, a metagraph is able to capture both topological arrangements involving the interactions between proteins and the associations between the proteins and keywords. Thus, proteins that are not neighbors in a noisy PPI network have a better chance to be topologically similar through keywords. Extended metagraph representations considering the disease occurrences, called Metagraph+, are fed into various classifiers for disease protein prediction in a supervised manner. Conducted experiments show that Metagraph+ consistently improves disease protein prediction on three different PPI databases and outperforms the state-of-the-art baselines including both diffusion-based methods and module-based methods. In addition, predictions of Metagraph+ attain better correlations with the literature findings from PubMed database. 2. Numerous studies for the discovery of disease-associated genes and their part in the development of a disease have been proposed over the last decades. Yet, automatically extracted features from a molecular network, have not been exploited for disease gene prediction. The second proposed technique in this thesis is an integrative framework called N2VKO which adopts a well-known representation learning method, node2vec. We combine network-based feature representations of genes obtained by node2vec with biological aspects of proteins. Then, we apply various feature selection methods to analyze their performance on disease gene prediction task. As the data for disease gene prediction is imbalanced, we further address this data imbalance issue by applying oversampling techniques on our novel representations to improve the prediction performance. Extensive experiments show that N2VKO significantly outperforms four state-of-the-art methods for disease gene prediction across seven diseases. Moreover, the categories of the biological aspects within N2VKO representations are listed to analyze their role in disease formation. We also provided literature evidence for N2VKO biological features over lung cancer. Finally, the literature evidence from PubMed database reveals the effectiveness of our proposed N2VKO framework for disease gene prediction. 3. The third proposed framework in this thesis addresses the candidate disease gene prediction problem. Since there are various biological networks providing different insights of genes, combining them as complementing each other in a unified framework is necessary to improve disease gene prediction performance. We propose a novel unsupervised algorithm for \Multi-view network embedding with Intra-Cross" consistencies (MICROS). This approach learns low-dimensional representations to be fed into various downstream tasks through a multi-view network embedding framework. MICROS is based on two well-known principles: diversity and collaboration. Former enables views to maintain their topological characteristics, the latter enables views to work together and reinforce each other. Unlike existing methods, we also examine a novel form of higher-order collaboration that has not been explored previously on multi-view networks and further integrate it into a unifying framework of consistencies to provide more robust, superior node representations. Finally, we conduct extensive experiments on three real-world multi-view networks. Our results demonstrate that our learned representations consistently outperform state-of-the-art approaches on various downstream tasks namely node-level tasks (i.e., classification and clustering), relationship mining and link prediction.