A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs
RNA-binding proteins (RBPs) play an important role in the synthesis and degradation of ribonucleic acid (RNA) molecules. The rapid and accurate identification of RBPs is essential for understanding the mechanisms of cell activity. Since identifying RBPs experimentally is expensive and time-consuming...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-07-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/14/8231 |
_version_ | 1797590485303296000 |
---|---|
author | Zhi-Sen Wei Jun Rao Yao-Jin Lin |
author_facet | Zhi-Sen Wei Jun Rao Yao-Jin Lin |
author_sort | Zhi-Sen Wei |
collection | DOAJ |
description | RNA-binding proteins (RBPs) play an important role in the synthesis and degradation of ribonucleic acid (RNA) molecules. The rapid and accurate identification of RBPs is essential for understanding the mechanisms of cell activity. Since identifying RBPs experimentally is expensive and time-consuming, computational methods have been explored to predict RBPs directly from protein sequences. In this paper, we developed an RBP prediction method named CnnRBP based on a convolution neural network. CnnRBP derived a sparse high-dimensional di- and tripeptide frequency feature vector from a protein sequence and then reduced this vector to a low-dimensional one using the Light Gradient Boosting Machine (LightGBM) algorithm. Then, the low-dimensional vectors derived from both RNA-binding proteins and non-RNA-binding proteins were fed to a multi-layer one-dimensional convolution network. Meanwhile, the SMOTE algorithm was used to alleviate the class imbalance in the training data. Extensive experiments showed that the proposed method can extract discriminative features to identify RBPs effectively. With 10-fold cross-validation on the training datasets, CnnRBP achieved AUC values of 99.98%, 99.69% and 96.72% for humans, <i>E. coli</i> and Salmonella, respectively. On the three independent datasets, CnnRBP achieved AUC values of 0.91, 0.96 and 0.91, outperforming the recent tripeptide-based method (i.e., TriPepSVM) by 8%, 4% and 5%, respectively. Compared with the state-of-the-art CNN-based predictor (i.e., iDRBP_MMC), CnnRBP achieved MCC values of 0.67, 0.68 and 0.73 with significant improvements by 6%, 6% and 15%, respectively. In addition, the cross-species testing shows that CnnRBP has a robust generalization performance for cross-species RBP prediction between close species. |
first_indexed | 2024-03-11T01:21:10Z |
format | Article |
id | doaj.art-f8531a0688c94944a3def50656bd8b2f |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-11T01:21:10Z |
publishDate | 2023-07-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-f8531a0688c94944a3def50656bd8b2f2023-11-18T18:10:08ZengMDPI AGApplied Sciences2076-34172023-07-011314823110.3390/app13148231A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short MotifsZhi-Sen Wei0Jun Rao1Yao-Jin Lin2School of Computer Science, Minnan Normal University, Zhangzhou 363000, ChinaSchool of Computer Science, Minnan Normal University, Zhangzhou 363000, ChinaSchool of Computer Science, Minnan Normal University, Zhangzhou 363000, ChinaRNA-binding proteins (RBPs) play an important role in the synthesis and degradation of ribonucleic acid (RNA) molecules. The rapid and accurate identification of RBPs is essential for understanding the mechanisms of cell activity. Since identifying RBPs experimentally is expensive and time-consuming, computational methods have been explored to predict RBPs directly from protein sequences. In this paper, we developed an RBP prediction method named CnnRBP based on a convolution neural network. CnnRBP derived a sparse high-dimensional di- and tripeptide frequency feature vector from a protein sequence and then reduced this vector to a low-dimensional one using the Light Gradient Boosting Machine (LightGBM) algorithm. Then, the low-dimensional vectors derived from both RNA-binding proteins and non-RNA-binding proteins were fed to a multi-layer one-dimensional convolution network. Meanwhile, the SMOTE algorithm was used to alleviate the class imbalance in the training data. Extensive experiments showed that the proposed method can extract discriminative features to identify RBPs effectively. With 10-fold cross-validation on the training datasets, CnnRBP achieved AUC values of 99.98%, 99.69% and 96.72% for humans, <i>E. coli</i> and Salmonella, respectively. On the three independent datasets, CnnRBP achieved AUC values of 0.91, 0.96 and 0.91, outperforming the recent tripeptide-based method (i.e., TriPepSVM) by 8%, 4% and 5%, respectively. Compared with the state-of-the-art CNN-based predictor (i.e., iDRBP_MMC), CnnRBP achieved MCC values of 0.67, 0.68 and 0.73 with significant improvements by 6%, 6% and 15%, respectively. In addition, the cross-species testing shows that CnnRBP has a robust generalization performance for cross-species RBP prediction between close species.https://www.mdpi.com/2076-3417/13/14/8231RNA-binding proteinconvolution neural networkshort peptide motifsfeature selection with LightGBM |
spellingShingle | Zhi-Sen Wei Jun Rao Yao-Jin Lin A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs Applied Sciences RNA-binding protein convolution neural network short peptide motifs feature selection with LightGBM |
title | A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs |
title_full | A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs |
title_fullStr | A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs |
title_full_unstemmed | A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs |
title_short | A Deep Model for Species-Specific Prediction of Ribonucleic-Acid-Binding Protein with Short Motifs |
title_sort | deep model for species specific prediction of ribonucleic acid binding protein with short motifs |
topic | RNA-binding protein convolution neural network short peptide motifs feature selection with LightGBM |
url | https://www.mdpi.com/2076-3417/13/14/8231 |
work_keys_str_mv | AT zhisenwei adeepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs AT junrao adeepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs AT yaojinlin adeepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs AT zhisenwei deepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs AT junrao deepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs AT yaojinlin deepmodelforspeciesspecificpredictionofribonucleicacidbindingproteinwithshortmotifs |