A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs

RNA-binding proteins (RBPs) play an important role in the synthesis and degradation of ribonucleic acid (RNA) molecules. The rapid and accurate identification of RBPs is essential for understanding the mechanisms of cell activity. Since identifying RBPs experimentally is expensive and time-consuming...

Full description

Bibliographic Details
Main Authors: Zhi-Sen Wei, Jun Rao, Yao-Jin Lin
Format: Article
Language:English
Published: MDPI AG 2023-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/14/8231
_version_ 1797590485303296000
author Zhi-Sen Wei
Jun Rao
Yao-Jin Lin
author_facet Zhi-Sen Wei
Jun Rao
Yao-Jin Lin
author_sort Zhi-Sen Wei
collection DOAJ
description RNA-binding proteins (RBPs) play an important role in the synthesis and degradation of ribonucleic acid (RNA) molecules. The rapid and accurate identification of RBPs is essential for understanding the mechanisms of cell activity. Since identifying RBPs experimentally is expensive and time-consuming, computational methods have been explored to predict RBPs directly from protein sequences. In this paper, we developed an RBP prediction method named CnnRBP based on a convolution neural network. CnnRBP derived a sparse high-dimensional di- and tripeptide frequency feature vector from a protein sequence and then reduced this vector to a low-dimensional one using the Light Gradient Boosting Machine (LightGBM) algorithm. Then, the low-dimensional vectors derived from both RNA-binding proteins and non-RNA-binding proteins were fed to a multi-layer one-dimensional convolution network. Meanwhile, the SMOTE algorithm was used to alleviate the class imbalance in the training data. Extensive experiments showed that the proposed method can extract discriminative features to identify RBPs effectively. With 10-fold cross-validation on the training datasets, CnnRBP achieved AUC values of 99.98%, 99.69% and 96.72% for humans, <i>E. coli</i> and Salmonella, respectively. On the three independent datasets, CnnRBP achieved AUC values of 0.91, 0.96 and 0.91, outperforming the recent tripeptide-based method (i.e., TriPepSVM) by 8%, 4% and 5%, respectively. Compared with the state-of-the-art CNN-based predictor (i.e., iDRBP_MMC), CnnRBP achieved MCC values of 0.67, 0.68 and 0.73 with significant improvements by 6%, 6% and 15%, respectively. In addition, the cross-species testing shows that CnnRBP has a robust generalization performance for cross-species RBP prediction between close species.
first_indexed 2024-03-11T01:21:10Z
format Article
id doaj.art-f8531a0688c94944a3def50656bd8b2f
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T01:21:10Z
publishDate 2023-07-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-f8531a0688c94944a3def50656bd8b2f2023-11-18T18:10:08ZengMDPI AGApplied Sciences2076-34172023-07-011314823110.3390/app13148231A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short MotifsZhi-Sen Wei0Jun Rao1Yao-Jin Lin2School of Computer Science, Minnan Normal University, Zhangzhou 363000, ChinaSchool of Computer Science, Minnan Normal University, Zhangzhou 363000, ChinaSchool of Computer Science, Minnan Normal University, Zhangzhou 363000, ChinaRNA-binding proteins (RBPs) play an important role in the synthesis and degradation of ribonucleic acid (RNA) molecules. The rapid and accurate identification of RBPs is essential for understanding the mechanisms of cell activity. Since identifying RBPs experimentally is expensive and time-consuming, computational methods have been explored to predict RBPs directly from protein sequences. In this paper, we developed an RBP prediction method named CnnRBP based on a convolution neural network. CnnRBP derived a sparse high-dimensional di- and tripeptide frequency feature vector from a protein sequence and then reduced this vector to a low-dimensional one using the Light Gradient Boosting Machine (LightGBM) algorithm. Then, the low-dimensional vectors derived from both RNA-binding proteins and non-RNA-binding proteins were fed to a multi-layer one-dimensional convolution network. Meanwhile, the SMOTE algorithm was used to alleviate the class imbalance in the training data. Extensive experiments showed that the proposed method can extract discriminative features to identify RBPs effectively. With 10-fold cross-validation on the training datasets, CnnRBP achieved AUC values of 99.98%, 99.69% and 96.72% for humans, <i>E. coli</i> and Salmonella, respectively. On the three independent datasets, CnnRBP achieved AUC values of 0.91, 0.96 and 0.91, outperforming the recent tripeptide-based method (i.e., TriPepSVM) by 8%, 4% and 5%, respectively. Compared with the state-of-the-art CNN-based predictor (i.e., iDRBP_MMC), CnnRBP achieved MCC values of 0.67, 0.68 and 0.73 with significant improvements by 6%, 6% and 15%, respectively. In addition, the cross-species testing shows that CnnRBP has a robust generalization performance for cross-species RBP prediction between close species.https://www.mdpi.com/2076-3417/13/14/8231RNA-binding proteinconvolution neural networkshort peptide motifsfeature selection with LightGBM
spellingShingle Zhi-Sen Wei
Jun Rao
Yao-Jin Lin
A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs
Applied Sciences
RNA-binding protein
convolution neural network
short peptide motifs
feature selection with LightGBM
title A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs
title_full A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs
title_fullStr A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs
title_full_unstemmed A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs
title_short A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs
title_sort deep model for species specific prediction of ribonucleic acid binding protein with short motifs
topic RNA-binding protein
convolution neural network
short peptide motifs
feature selection with LightGBM
url https://www.mdpi.com/2076-3417/13/14/8231
work_keys_str_mv AT zhisenwei adeepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs
AT junrao adeepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs
AT yaojinlin adeepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs
AT zhisenwei deepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs
AT junrao deepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs
AT yaojinlin deepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs