Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

In this current era, the identification of both known and novel cell types, the representation of cells, predicting cell fates, classifying various tumor types, and studying heterogeneity in various cells are the key areas of interest in the analysis of single-cell RNA sequencing (scRNA-seq) data. D...

Full description

Bibliographic Details
Main Authors:	Soumita Seth, Saurav Mallik, Atikul Islam, Tapas Bhadra, Arup Roy, Pawan Kumar Singh, Aimin Li, Zhongming Zhao
Format:	Article
Language:	English
Published:	MDPI AG 2023-10-01
Series:	Mathematics
Subjects:	single-cell sequencing gene signature data imputation feature selection maximum relevance and minimum redundancy shrinkage clustering
Online Access:	https://www.mdpi.com/2227-7390/11/20/4315

_version_	1827720559381708800
author	Soumita Seth Saurav Mallik Atikul Islam Tapas Bhadra Arup Roy Pawan Kumar Singh Aimin Li Zhongming Zhao
author_facet	Soumita Seth Saurav Mallik Atikul Islam Tapas Bhadra Arup Roy Pawan Kumar Singh Aimin Li Zhongming Zhao
author_sort	Soumita Seth
collection	DOAJ
description	In this current era, the identification of both known and novel cell types, the representation of cells, predicting cell fates, classifying various tumor types, and studying heterogeneity in various cells are the key areas of interest in the analysis of single-cell RNA sequencing (scRNA-seq) data. Due to the nature of the data, cluster identification in single-cell sequencing data with high dimensions presents several difficulties. In this paper, we introduce a new framework that combines various strategies such as imputed matrix, minimum redundancy maximum relevance (MRMR) feature selection, and shrinkage clustering to discover gene signatures from scRNA-seq data. Firstly, we conducted the pre-filtering of the “drop-out” value in the data focusing solely on imputing the identified “drop-out” values. Next, we applied the MRMR feature selection method to the imputed data and obtained the top 100 features based on the MRMR feature selection optimization scores for further downstream analysis. Thereafter, we employed shrinkage clustering on the selected feature matrix to identify the cell clusters using a global optimization approach. Finally, we applied the Limma-Voom R tool employing voom normalization and an empirical Bayes test to detect differentially expressed features with a false discovery rate (FDR) < 0.001. In addition, we performed the KEGG pathway and gene ontology enrichment analysis of the identified biomarkers using David 6.8 software. Furthermore, we conducted miRNA target detection for the top gene markers and performed miRNA target gene interaction network analysis using the Cytoscape online tool. Subsequently, we compared our detected 100 markers with our previously detected top 100 cluster-specified markers ranked by FDR of the latest published article and discovered three common markers; namely, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>C</mi><mi>y</mi><mi>p</mi><mn>2</mn><mi>b</mi><mn>10</mn></mrow></semantics></math></inline-formula>, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>M</mi><mi>t</mi><mn>1</mn></mrow></semantics></math></inline-formula>, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>A</mi><mi>l</mi><mi>p</mi><mi>i</mi></mrow></semantics></math></inline-formula>, along with 97 novel markers. In addition, the Gene Set Enrichment Analysis (GSEA) of both marker sets also yields similar outcomes. Apart from this, we performed another comparative study with another published method, demonstrating that our model detects more significant markers than that model. To assess the efficiency of our framework, we apply it to another dataset and identify 20 strongly significant up-regulated markers. Additionally, we perform a comparative study of different imputation methods and include an ablation study to prove that every key phase of our framework is essential and strongly recommended. In summary, our proposed integrated framework efficiently discovers differentially expressed stronger gene signatures as well as up-regulated markers in single-cell RNA sequencing data.
first_indexed	2024-03-10T21:05:11Z
format	Article
id	doaj.art-dad797dc43164655a041b73a925f7781
institution	Directory Open Access Journal
issn	2227-7390
language	English
last_indexed	2024-03-10T21:05:11Z
publishDate	2023-10-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj.art-dad797dc43164655a041b73a925f77812023-11-19T17:14:17ZengMDPI AGMathematics2227-73902023-10-011120431510.3390/math11204315Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene ClusteringSoumita Seth0Saurav Mallik1Atikul Islam2Tapas Bhadra3Arup Roy4Pawan Kumar Singh5Aimin Li6Zhongming Zhao7Department of Computer Science and Engineering, Future Institute of Engineering and Management, Narendrapur, Kolkata 700150, West Bengal, IndiaDepartment of Environmental Health, Harvard T H Chan School of Public Health, Boston, MA 02115, USADepartment of Computer Science and Engineering, University of Kalyani, Kalyani 741235, West Bengal, IndiaDepartment of Computer Science and Engineering, Aliah University, Kolkata 700160, West Bengal, IndiaDepartment of Computer Science and Engineering, Budge Budge Institute of Technology, Kolkata 700137, West Bengal, IndiaDepartment of Information Technology, Jadavpur University, Jadavpur University Second Campus, Plot No. 8, Salt Lake Bypass, LB Block, Sector III, Kolkata 700106, West Bengal, IndiaShaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, ChinaCenter for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USAIn this current era, the identification of both known and novel cell types, the representation of cells, predicting cell fates, classifying various tumor types, and studying heterogeneity in various cells are the key areas of interest in the analysis of single-cell RNA sequencing (scRNA-seq) data. Due to the nature of the data, cluster identification in single-cell sequencing data with high dimensions presents several difficulties. In this paper, we introduce a new framework that combines various strategies such as imputed matrix, minimum redundancy maximum relevance (MRMR) feature selection, and shrinkage clustering to discover gene signatures from scRNA-seq data. Firstly, we conducted the pre-filtering of the “drop-out” value in the data focusing solely on imputing the identified “drop-out” values. Next, we applied the MRMR feature selection method to the imputed data and obtained the top 100 features based on the MRMR feature selection optimization scores for further downstream analysis. Thereafter, we employed shrinkage clustering on the selected feature matrix to identify the cell clusters using a global optimization approach. Finally, we applied the Limma-Voom R tool employing voom normalization and an empirical Bayes test to detect differentially expressed features with a false discovery rate (FDR) < 0.001. In addition, we performed the KEGG pathway and gene ontology enrichment analysis of the identified biomarkers using David 6.8 software. Furthermore, we conducted miRNA target detection for the top gene markers and performed miRNA target gene interaction network analysis using the Cytoscape online tool. Subsequently, we compared our detected 100 markers with our previously detected top 100 cluster-specified markers ranked by FDR of the latest published article and discovered three common markers; namely, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>C</mi><mi>y</mi><mi>p</mi><mn>2</mn><mi>b</mi><mn>10</mn></mrow></semantics></math></inline-formula>, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>M</mi><mi>t</mi><mn>1</mn></mrow></semantics></math></inline-formula>, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>A</mi><mi>l</mi><mi>p</mi><mi>i</mi></mrow></semantics></math></inline-formula>, along with 97 novel markers. In addition, the Gene Set Enrichment Analysis (GSEA) of both marker sets also yields similar outcomes. Apart from this, we performed another comparative study with another published method, demonstrating that our model detects more significant markers than that model. To assess the efficiency of our framework, we apply it to another dataset and identify 20 strongly significant up-regulated markers. Additionally, we perform a comparative study of different imputation methods and include an ablation study to prove that every key phase of our framework is essential and strongly recommended. In summary, our proposed integrated framework efficiently discovers differentially expressed stronger gene signatures as well as up-regulated markers in single-cell RNA sequencing data.https://www.mdpi.com/2227-7390/11/20/4315single-cell sequencinggene signaturedata imputationfeature selectionmaximum relevance and minimum redundancyshrinkage clustering
spellingShingle	Soumita Seth Saurav Mallik Atikul Islam Tapas Bhadra Arup Roy Pawan Kumar Singh Aimin Li Zhongming Zhao Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering Mathematics single-cell sequencing gene signature data imputation feature selection maximum relevance and minimum redundancy shrinkage clustering
title	Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering
title_full	Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering
title_fullStr	Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering
title_full_unstemmed	Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering
title_short	Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering
title_sort	identifying genetic signatures from single cell rna sequencing data by matrix imputation and reduced set gene clustering
topic	single-cell sequencing gene signature data imputation feature selection maximum relevance and minimum redundancy shrinkage clustering
url	https://www.mdpi.com/2227-7390/11/20/4315
work_keys_str_mv	AT soumitaseth identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering AT sauravmallik identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering AT atikulislam identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering AT tapasbhadra identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering AT aruproy identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering AT pawankumarsingh identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering AT aiminli identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering AT zhongmingzhao identifyinggeneticsignaturesfromsinglecellrnasequencingdatabymatriximputationandreducedsetgeneclustering

Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

Similar Items