An improved biclustering algorithm with overlapping control for identification of informative genes and pathways

Due to the rise of microarray technology, many tools and methods have been developed to analyse the huge number of gene expression data such as clustering analysis. This clustering analysis is being used for different purposes such as functional annotation, tissue classification and motif identifica...

Full description

Bibliographic Details
Main Author: Mohammad Kusairi, Rohani
Format: Thesis
Language:English
Published: 2021
Subjects:
Online Access:http://eprints.utm.my/102857/1/RohaniMohammadMSC2021.pdf.pdf
Description
Summary:Due to the rise of microarray technology, many tools and methods have been developed to analyse the huge number of gene expression data such as clustering analysis. This clustering analysis is being used for different purposes such as functional annotation, tissue classification and motif identification. Moreover, the clustering methods have made an achievement in the analysis of genetic data by clustering those genes with similar expression patterns into one cluster. Therefore, the genes with similar patterns are obtained and those genes are further analysed to extract the potential biological information. Traditional clustering methods are used to group genes that behave similarly under all conditions but are unable to perform twodimensional grouping simultaneously. As a result, clusters obtained either contain all rows of data matrix or all columns of data matrix and thus ignoring the local coexpression effects which are present in only a subset of all biological samples. Other than that, clustering methods are unable to assign genes to multiple clusters as they do not correspond to the gene natural behaviour which has more than one function and can participate in multiple pathways. Due to limitations of traditional clustering analysis, a biclustering algorithm as a new method was introduced to identify local patterns in the data by clustering the gene dimension and condition dimension simultaneously. This local correlation information between the subset of genes and conditions is then used to improve the accuracy of clustering results. However, overlapping is another issue in biclustering. As some of the genes may belong to multiple functional categories, overlapping may be considered as one of the bicluster’s behaviours but the overlapping among the bicluster need to be controlled to prevent the redundancy of the biclusters formed. This research proposed an improved overlapping control in biclustering algorithms for identification of informative genes from the gene expression data. The overlapping control is crucial in biclusters to hinder the redundancy of the biclusters produced and indirectly the number of the biclusters obtained can be reduced. Experiments were conducted on two microarray data sets (ovarian cancer dataset and glioblastoma cancer dataset). The results obtained were evaluated using 10-fold cross validation and compared with the Qualitative Biclustering Algorithm (Qubic). In addition, the results were further analysed in terms of accuracy, standard deviation, variance and t-test and the proposed method indicated a higher accuracy for Ovarian dataset (96.54%) and glioblastoma dataset (75.68%). This method showed consistent improvement in terms of accuracy of the biclusters when tested using SVM classifier over the Qualitative Biclustering Algorithm (Qubic) method. Biological context verification was then conducted to elucidate the relation of the selected genes such as ERBB2, VCAM1, CD3D and pathways (Endocytosis pathway, Bladder Cancer pathway and Pancreatic Cancer pathway) with the phenotype under study.