Gene expression data classification using topology and machine learning models

Abstract Background Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applicat...

Full description

Bibliographic Details
Main Authors: Tamal K. Dey, Sayan Mandal, Soham Mukherjee
Format: Article
Language:English
Published: BMC 2022-05-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-022-04704-z
_version_ 1811341311727370240
author Tamal K. Dey
Sayan Mandal
Soham Mukherjee
author_facet Tamal K. Dey
Sayan Mandal
Soham Mukherjee
author_sort Tamal K. Dey
collection DOAJ
description Abstract Background Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes. Results The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly. Conclusions In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.
first_indexed 2024-04-13T18:55:00Z
format Article
id doaj.art-e98e2c2383c64e7691aed6411d222e3e
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-13T18:55:00Z
publishDate 2022-05-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-e98e2c2383c64e7691aed6411d222e3e2022-12-22T02:34:18ZengBMCBMC Bioinformatics1471-21052022-05-0122S1012210.1186/s12859-022-04704-zGene expression data classification using topology and machine learning modelsTamal K. Dey0Sayan Mandal1Soham Mukherjee2Department of Computer Science, Purdue UniversityDepartment of Computer Science and Engineering, The Ohio State UniversityDepartment of Computer Science, Purdue UniversityAbstract Background Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes. Results The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly. Conclusions In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.https://doi.org/10.1186/s12859-022-04704-zTopological data analysisGene expressionPersistent cyclesNeural network
spellingShingle Tamal K. Dey
Sayan Mandal
Soham Mukherjee
Gene expression data classification using topology and machine learning models
BMC Bioinformatics
Topological data analysis
Gene expression
Persistent cycles
Neural network
title Gene expression data classification using topology and machine learning models
title_full Gene expression data classification using topology and machine learning models
title_fullStr Gene expression data classification using topology and machine learning models
title_full_unstemmed Gene expression data classification using topology and machine learning models
title_short Gene expression data classification using topology and machine learning models
title_sort gene expression data classification using topology and machine learning models
topic Topological data analysis
Gene expression
Persistent cycles
Neural network
url https://doi.org/10.1186/s12859-022-04704-z
work_keys_str_mv AT tamalkdey geneexpressiondataclassificationusingtopologyandmachinelearningmodels
AT sayanmandal geneexpressiondataclassificationusingtopologyandmachinelearningmodels
AT sohammukherjee geneexpressiondataclassificationusingtopologyandmachinelearningmodels