RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition

5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites...

Full description

Bibliographic Details
Main Authors: Ting Fang, Zizheng Zhang, Rui Sun, Lin Zhu, Jingjing He, Bei Huang, Yi Xiong, Xiaolei Zhu
Format: Article
Language:English
Published: Elsevier 2019-12-01
Series:Molecular Therapy: Nucleic Acids
Online Access:http://www.sciencedirect.com/science/article/pii/S216225311930304X
_version_ 1819211588876369920
author Ting Fang
Zizheng Zhang
Rui Sun
Lin Zhu
Jingjing He
Bei Huang
Yi Xiong
Xiaolei Zhu
author_facet Ting Fang
Zizheng Zhang
Rui Sun
Lin Zhu
Jingjing He
Bei Huang
Yi Xiong
Xiaolei Zhu
author_sort Ting Fang
collection DOAJ
description 5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research. Keywords: 5-methylcytosine site, post-transcriptional modification, support vector machine, nucleotide composition, prediction
first_indexed 2024-12-23T06:29:28Z
format Article
id doaj.art-b31a80349ed448a0a969dbbc36de7ce7
institution Directory Open Access Journal
issn 2162-2531
language English
last_indexed 2024-12-23T06:29:28Z
publishDate 2019-12-01
publisher Elsevier
record_format Article
series Molecular Therapy: Nucleic Acids
spelling doaj.art-b31a80349ed448a0a969dbbc36de7ce72022-12-21T17:56:58ZengElsevierMolecular Therapy: Nucleic Acids2162-25312019-12-0118739747RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide CompositionTing Fang0Zizheng Zhang1Rui Sun2Lin Zhu3Jingjing He4Bei Huang5Yi Xiong6Xiaolei Zhu7School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China; School of Life Sciences, Anhui University, Hefei, Anhui 230601, ChinaSchool of Life Sciences, Anhui University, Hefei, Anhui 230601, ChinaBeijing Baidu Netcom Sciences and Technology Co., Ltd., Beijing, ChinaSchool of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, ChinaSchool of Life Sciences, Anhui University, Hefei, Anhui 230601, ChinaSchool of Life Sciences, Anhui University, Hefei, Anhui 230601, China; Corresponding author: Bei Huang, School of Life Sciences, Anhui University, Hefei, Anhui 230601, China.State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China; Corresponding author: Yi Xiong, State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China; School of Life Sciences, Anhui University, Hefei, Anhui 230601, China; Corresponding author: Xiaolei Zhu, School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research. Keywords: 5-methylcytosine site, post-transcriptional modification, support vector machine, nucleotide composition, predictionhttp://www.sciencedirect.com/science/article/pii/S216225311930304X
spellingShingle Ting Fang
Zizheng Zhang
Rui Sun
Lin Zhu
Jingjing He
Bei Huang
Yi Xiong
Xiaolei Zhu
RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
Molecular Therapy: Nucleic Acids
title RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_full RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_fullStr RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_full_unstemmed RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_short RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_sort rnam5cpred prediction of rna 5 methylcytosine sites based on three different kinds of nucleotide composition
url http://www.sciencedirect.com/science/article/pii/S216225311930304X
work_keys_str_mv AT tingfang rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT zizhengzhang rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT ruisun rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT linzhu rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT jingjinghe rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT beihuang rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT yixiong rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT xiaoleizhu rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition