The effects of shared information on semantic calculations in the gene ontology

The structured vocabulary that describes gene function, the gene ontology (GO), serves as a powerful tool in biological research. One application of GO in computational biology calculates semantic similarity between two concepts to make inferences about the functional similarity of genes. A class of...

Full description

Bibliographic Details
Main Authors: Paul W. Bible, Hong-Wei Sun, Maria I. Morasso, Rasiah Loganantharaj, Lai Wei
Format: Article
Language:English
Published: Elsevier 2017-01-01
Series:Computational and Structural Biotechnology Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037016300861
_version_ 1818153549323829248
author Paul W. Bible
Hong-Wei Sun
Maria I. Morasso
Rasiah Loganantharaj
Lai Wei
author_facet Paul W. Bible
Hong-Wei Sun
Maria I. Morasso
Rasiah Loganantharaj
Lai Wei
author_sort Paul W. Bible
collection DOAJ
description The structured vocabulary that describes gene function, the gene ontology (GO), serves as a powerful tool in biological research. One application of GO in computational biology calculates semantic similarity between two concepts to make inferences about the functional similarity of genes. A class of term similarity algorithms explicitly calculates the shared information (SI) between concepts then substitutes this calculation into traditional term similarity measures such as Resnik, Lin, and Jiang-Conrath. Alternative SI approaches, when combined with ontology choice and term similarity type, lead to many gene-to-gene similarity measures. No thorough investigation has been made into the behavior, complexity, and performance of semantic methods derived from distinct SI approaches. We apply bootstrapping to compare the generalized performance of 57 gene-to-gene semantic measures across six benchmarks. Considering the number of measures, we additionally evaluate whether these methods can be leveraged through ensemble machine learning to improve prediction performance. Results showed that the choice of ontology type most strongly influenced performance across all evaluations. Combining measures into an ensemble classifier reduces cross-validation error beyond any individual measure for protein interaction prediction. This improvement resulted from information gained through the combination of ontology types as ensemble methods within each GO type offered no improvement. These results demonstrate that multiple SI measures can be leveraged for machine learning tasks such as automated gene function prediction by incorporating methods from across the ontologies. To facilitate future research in this area, we developed the GO Graph Tool Kit (GGTK), an open source C++ library with Python interface (github.com/paulbible/ggtk). Keywords: Semantic similarity, Gene ontology, Function prediction, Machine learning, Protein–protein interaction, Gene expression
first_indexed 2024-12-11T14:12:23Z
format Article
id doaj.art-cf978c9158a84c4b84660e06ff8dc1ef
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-12-11T14:12:23Z
publishDate 2017-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-cf978c9158a84c4b84660e06ff8dc1ef2022-12-22T01:03:22ZengElsevierComputational and Structural Biotechnology Journal2001-03702017-01-0115195211The effects of shared information on semantic calculations in the gene ontologyPaul W. Bible0Hong-Wei Sun1Maria I. Morasso2Rasiah Loganantharaj3Lai Wei4State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China; Corresponding authors.Biodata Mining and Discovery Section, Office of Science and Technology, Intramural Research Program, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, MarylandLaboratory of Skin Biology, Intramural Research Program, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, MarylandLaboratory of Bioinformatics, Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, LouisianaState Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China; Corresponding authors.The structured vocabulary that describes gene function, the gene ontology (GO), serves as a powerful tool in biological research. One application of GO in computational biology calculates semantic similarity between two concepts to make inferences about the functional similarity of genes. A class of term similarity algorithms explicitly calculates the shared information (SI) between concepts then substitutes this calculation into traditional term similarity measures such as Resnik, Lin, and Jiang-Conrath. Alternative SI approaches, when combined with ontology choice and term similarity type, lead to many gene-to-gene similarity measures. No thorough investigation has been made into the behavior, complexity, and performance of semantic methods derived from distinct SI approaches. We apply bootstrapping to compare the generalized performance of 57 gene-to-gene semantic measures across six benchmarks. Considering the number of measures, we additionally evaluate whether these methods can be leveraged through ensemble machine learning to improve prediction performance. Results showed that the choice of ontology type most strongly influenced performance across all evaluations. Combining measures into an ensemble classifier reduces cross-validation error beyond any individual measure for protein interaction prediction. This improvement resulted from information gained through the combination of ontology types as ensemble methods within each GO type offered no improvement. These results demonstrate that multiple SI measures can be leveraged for machine learning tasks such as automated gene function prediction by incorporating methods from across the ontologies. To facilitate future research in this area, we developed the GO Graph Tool Kit (GGTK), an open source C++ library with Python interface (github.com/paulbible/ggtk). Keywords: Semantic similarity, Gene ontology, Function prediction, Machine learning, Protein–protein interaction, Gene expressionhttp://www.sciencedirect.com/science/article/pii/S2001037016300861
spellingShingle Paul W. Bible
Hong-Wei Sun
Maria I. Morasso
Rasiah Loganantharaj
Lai Wei
The effects of shared information on semantic calculations in the gene ontology
Computational and Structural Biotechnology Journal
title The effects of shared information on semantic calculations in the gene ontology
title_full The effects of shared information on semantic calculations in the gene ontology
title_fullStr The effects of shared information on semantic calculations in the gene ontology
title_full_unstemmed The effects of shared information on semantic calculations in the gene ontology
title_short The effects of shared information on semantic calculations in the gene ontology
title_sort effects of shared information on semantic calculations in the gene ontology
url http://www.sciencedirect.com/science/article/pii/S2001037016300861
work_keys_str_mv AT paulwbible theeffectsofsharedinformationonsemanticcalculationsinthegeneontology
AT hongweisun theeffectsofsharedinformationonsemanticcalculationsinthegeneontology
AT mariaimorasso theeffectsofsharedinformationonsemanticcalculationsinthegeneontology
AT rasiahloganantharaj theeffectsofsharedinformationonsemanticcalculationsinthegeneontology
AT laiwei theeffectsofsharedinformationonsemanticcalculationsinthegeneontology
AT paulwbible effectsofsharedinformationonsemanticcalculationsinthegeneontology
AT hongweisun effectsofsharedinformationonsemanticcalculationsinthegeneontology
AT mariaimorasso effectsofsharedinformationonsemanticcalculationsinthegeneontology
AT rasiahloganantharaj effectsofsharedinformationonsemanticcalculationsinthegeneontology
AT laiwei effectsofsharedinformationonsemanticcalculationsinthegeneontology