MS-<it>k</it>NN: protein function prediction by integrating multiple data sources

<p>Abstract</p> <p>Background</p> <p>Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known infor...

Full description

Bibliographic Details
Main Authors: Lan Liang, Djuric Nemanja, Guo Yuhong, Vucetic Slobodan
Format: Article
Language:English
Published: BMC 2013-02-01
Series:BMC Bioinformatics
_version_ 1830349657795985408
author Lan Liang
Djuric Nemanja
Guo Yuhong
Vucetic Slobodan
author_facet Lan Liang
Djuric Nemanja
Guo Yuhong
Vucetic Slobodan
author_sort Lan Liang
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source <it>k</it>-Nearest Neighbor (MS-<it>k</it>NN) algorithm for function prediction, which finds <it>k</it>-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions.</p> <p>Results</p> <p>We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-<it>k</it>NN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-<it>k</it>NN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-<it>k</it>NN was rather small.</p> <p>Conclusions</p> <p>Based on our results, we have several useful insights: (1) the <it>k</it>-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.</p>
first_indexed 2024-12-20T00:07:46Z
format Article
id doaj.art-ed443511b3694526b1a0cd91dd1d7a11
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-20T00:07:46Z
publishDate 2013-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-ed443511b3694526b1a0cd91dd1d7a112022-12-21T20:00:36ZengBMCBMC Bioinformatics1471-21052013-02-0114Suppl 3S810.1186/1471-2105-14-S3-S8MS-<it>k</it>NN: protein function prediction by integrating multiple data sourcesLan LiangDjuric NemanjaGuo YuhongVucetic Slobodan<p>Abstract</p> <p>Background</p> <p>Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source <it>k</it>-Nearest Neighbor (MS-<it>k</it>NN) algorithm for function prediction, which finds <it>k</it>-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions.</p> <p>Results</p> <p>We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-<it>k</it>NN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-<it>k</it>NN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-<it>k</it>NN was rather small.</p> <p>Conclusions</p> <p>Based on our results, we have several useful insights: (1) the <it>k</it>-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.</p>
spellingShingle Lan Liang
Djuric Nemanja
Guo Yuhong
Vucetic Slobodan
MS-<it>k</it>NN: protein function prediction by integrating multiple data sources
BMC Bioinformatics
title MS-<it>k</it>NN: protein function prediction by integrating multiple data sources
title_full MS-<it>k</it>NN: protein function prediction by integrating multiple data sources
title_fullStr MS-<it>k</it>NN: protein function prediction by integrating multiple data sources
title_full_unstemmed MS-<it>k</it>NN: protein function prediction by integrating multiple data sources
title_short MS-<it>k</it>NN: protein function prediction by integrating multiple data sources
title_sort ms it k it nn protein function prediction by integrating multiple data sources
work_keys_str_mv AT lanliang msitkitnnproteinfunctionpredictionbyintegratingmultipledatasources
AT djuricnemanja msitkitnnproteinfunctionpredictionbyintegratingmultipledatasources
AT guoyuhong msitkitnnproteinfunctionpredictionbyintegratingmultipledatasources
AT vuceticslobodan msitkitnnproteinfunctionpredictionbyintegratingmultipledatasources