Gene expression data classification using topology and machine learning models
Abstract Background Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applicat...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2022-05-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12859-022-04704-z |
_version_ | 1811341311727370240 |
---|---|
author | Tamal K. Dey Sayan Mandal Soham Mukherjee |
author_facet | Tamal K. Dey Sayan Mandal Soham Mukherjee |
author_sort | Tamal K. Dey |
collection | DOAJ |
description | Abstract Background Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes. Results The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly. Conclusions In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes. |
first_indexed | 2024-04-13T18:55:00Z |
format | Article |
id | doaj.art-e98e2c2383c64e7691aed6411d222e3e |
institution | Directory Open Access Journal |
issn | 1471-2105 |
language | English |
last_indexed | 2024-04-13T18:55:00Z |
publishDate | 2022-05-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj.art-e98e2c2383c64e7691aed6411d222e3e2022-12-22T02:34:18ZengBMCBMC Bioinformatics1471-21052022-05-0122S1012210.1186/s12859-022-04704-zGene expression data classification using topology and machine learning modelsTamal K. Dey0Sayan Mandal1Soham Mukherjee2Department of Computer Science, Purdue UniversityDepartment of Computer Science and Engineering, The Ohio State UniversityDepartment of Computer Science, Purdue UniversityAbstract Background Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes. Results The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly. Conclusions In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.https://doi.org/10.1186/s12859-022-04704-zTopological data analysisGene expressionPersistent cyclesNeural network |
spellingShingle | Tamal K. Dey Sayan Mandal Soham Mukherjee Gene expression data classification using topology and machine learning models BMC Bioinformatics Topological data analysis Gene expression Persistent cycles Neural network |
title | Gene expression data classification using topology and machine learning models |
title_full | Gene expression data classification using topology and machine learning models |
title_fullStr | Gene expression data classification using topology and machine learning models |
title_full_unstemmed | Gene expression data classification using topology and machine learning models |
title_short | Gene expression data classification using topology and machine learning models |
title_sort | gene expression data classification using topology and machine learning models |
topic | Topological data analysis Gene expression Persistent cycles Neural network |
url | https://doi.org/10.1186/s12859-022-04704-z |
work_keys_str_mv | AT tamalkdey geneexpressiondataclassificationusingtopologyandmachinelearningmodels AT sayanmandal geneexpressiondataclassificationusingtopologyandmachinelearningmodels AT sohammukherjee geneexpressiondataclassificationusingtopologyandmachinelearningmodels |