Predicting RNA-Protein Interactions Using Only Sequence Information

<p>Abstract</p> <p>Background</p> <p>RNA-protein interactions (RPIs) play important roles in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulation of gene expression to host defense against pathogens. High throughput expe...

Full description

Bibliographic Details
Main Authors: Muppirala Usha K, Honavar Vasant G, Dobbs Drena
Format: Article
Language:English
Published: BMC 2011-12-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/12/489
Description
Summary:<p>Abstract</p> <p>Background</p> <p>RNA-protein interactions (RPIs) play important roles in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulation of gene expression to host defense against pathogens. High throughput experiments to identify RNA-protein interactions are beginning to provide valuable information about the complexity of RNA-protein interaction networks, but are expensive and time consuming. Hence, there is a need for reliable computational methods for predicting RNA-protein interactions.</p> <p>Results</p> <p>We propose <b><it>RPISeq</it></b>, a family of classifiers for predicting <b><it>R</it></b>NA-<b><it>p</it></b>rotein <b><it>i</it></b>nteractions using only <b><it>seq</it></b>uence information. Given the sequences of an RNA and a protein as input, <it>RPIseq </it>predicts whether or not the RNA-protein pair interact. The RNA sequence is encoded as a normalized vector of its ribonucleotide 4-mer composition, and the protein sequence is encoded as a normalized vector of its 3-mer composition, based on a 7-letter reduced alphabet representation. Two variants of <it>RPISeq </it>are presented: <it>RPISeq-SVM</it>, which uses a Support Vector Machine (SVM) classifier and <it>RPISeq-RF</it>, which uses a Random Forest classifier. On two non-redundant benchmark datasets extracted from the Protein-RNA Interface Database (PRIDB), <it>RPISeq </it>achieved an AUC (Area Under the Receiver Operating Characteristic (ROC) curve) of 0.96 and 0.92. On a third dataset containing only mRNA-protein interactions, the performance of <it>RPISeq </it>was competitive with that of a published method that requires information regarding many different features (e.g., mRNA half-life, GO annotations) of the putative RNA and protein partners. In addition, <it>RPISeq </it>classifiers trained using the PRIDB data correctly predicted the majority (57-99%) of non-coding RNA-protein interactions in NPInter-derived networks from <it>E. coli, S. cerevisiae, D. melanogaster, M. musculus</it>, and <it>H. sapiens</it>.</p> <p>Conclusions</p> <p>Our experiments with <it>RPISeq </it>demonstrate that RNA-protein interactions can be reliably predicted using only sequence-derived information. <it>RPISeq </it>offers an inexpensive method for computational construction of RNA-protein interaction networks, and should provide useful insights into the function of non-coding RNAs. <it>RPISeq </it>is freely available as a web-based server at <url>http://pridb.gdcb.iastate.edu/RPISeq/.</url></p>
ISSN:1471-2105