RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2019-12-01
|
Series: | Molecular Therapy: Nucleic Acids |
Online Access: | http://www.sciencedirect.com/science/article/pii/S216225311930304X |
_version_ | 1819211588876369920 |
---|---|
author | Ting Fang Zizheng Zhang Rui Sun Lin Zhu Jingjing He Bei Huang Yi Xiong Xiaolei Zhu |
author_facet | Ting Fang Zizheng Zhang Rui Sun Lin Zhu Jingjing He Bei Huang Yi Xiong Xiaolei Zhu |
author_sort | Ting Fang |
collection | DOAJ |
description | 5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research. Keywords: 5-methylcytosine site, post-transcriptional modification, support vector machine, nucleotide composition, prediction |
first_indexed | 2024-12-23T06:29:28Z |
format | Article |
id | doaj.art-b31a80349ed448a0a969dbbc36de7ce7 |
institution | Directory Open Access Journal |
issn | 2162-2531 |
language | English |
last_indexed | 2024-12-23T06:29:28Z |
publishDate | 2019-12-01 |
publisher | Elsevier |
record_format | Article |
series | Molecular Therapy: Nucleic Acids |
spelling | doaj.art-b31a80349ed448a0a969dbbc36de7ce72022-12-21T17:56:58ZengElsevierMolecular Therapy: Nucleic Acids2162-25312019-12-0118739747RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide CompositionTing Fang0Zizheng Zhang1Rui Sun2Lin Zhu3Jingjing He4Bei Huang5Yi Xiong6Xiaolei Zhu7School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China; School of Life Sciences, Anhui University, Hefei, Anhui 230601, ChinaSchool of Life Sciences, Anhui University, Hefei, Anhui 230601, ChinaBeijing Baidu Netcom Sciences and Technology Co., Ltd., Beijing, ChinaSchool of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, ChinaSchool of Life Sciences, Anhui University, Hefei, Anhui 230601, ChinaSchool of Life Sciences, Anhui University, Hefei, Anhui 230601, China; Corresponding author: Bei Huang, School of Life Sciences, Anhui University, Hefei, Anhui 230601, China.State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China; Corresponding author: Yi Xiong, State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China; School of Life Sciences, Anhui University, Hefei, Anhui 230601, China; Corresponding author: Xiaolei Zhu, School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research. Keywords: 5-methylcytosine site, post-transcriptional modification, support vector machine, nucleotide composition, predictionhttp://www.sciencedirect.com/science/article/pii/S216225311930304X |
spellingShingle | Ting Fang Zizheng Zhang Rui Sun Lin Zhu Jingjing He Bei Huang Yi Xiong Xiaolei Zhu RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition Molecular Therapy: Nucleic Acids |
title | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_full | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_fullStr | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_full_unstemmed | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_short | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_sort | rnam5cpred prediction of rna 5 methylcytosine sites based on three different kinds of nucleotide composition |
url | http://www.sciencedirect.com/science/article/pii/S216225311930304X |
work_keys_str_mv | AT tingfang rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT zizhengzhang rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT ruisun rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT linzhu rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT jingjinghe rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT beihuang rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT yixiong rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT xiaoleizhu rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition |