Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

Abstract Background Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tert...

Full description

Bibliographic Details
Main Authors: Yasser B. Ruiz-Blanco, Guillermin Agüero-Chapin, Enrique García-Hernández, Orlando Álvarez, Agostinho Antunes, James Green
Format: Article
Language:English
Published: BMC 2017-07-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-017-1758-x
_version_ 1811318530984902656
author Yasser B. Ruiz-Blanco
Guillermin Agüero-Chapin
Enrique García-Hernández
Orlando Álvarez
Agostinho Antunes
James Green
author_facet Yasser B. Ruiz-Blanco
Guillermin Agüero-Chapin
Enrique García-Hernández
Orlando Álvarez
Agostinho Antunes
James Green
author_sort Yasser B. Ruiz-Blanco
collection DOAJ
description Abstract Background Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D–structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. Results Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D–structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal’s descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D–structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. Conclusions The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches.
first_indexed 2024-04-13T12:26:40Z
format Article
id doaj.art-ff0a29516e424b4d854ec40d0e209808
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-13T12:26:40Z
publishDate 2017-07-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-ff0a29516e424b4d854ec40d0e2098082022-12-22T02:46:59ZengBMCBMC Bioinformatics1471-21052017-07-0118111410.1186/s12859-017-1758-xExploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zoneYasser B. Ruiz-Blanco0Guillermin Agüero-Chapin1Enrique García-Hernández2Orlando Álvarez3Agostinho Antunes4James Green5Facultad de Química y Farmacia, Universidad Central “Marta Abreu” de Las VillasCIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Terminal de Cruzeiros do Porto de LeixõesInstituto de Química, Universidad Nacional Autónoma de México (UNAM)Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV)CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Terminal de Cruzeiros do Porto de LeixõesDepartment of Systems and Computer Engineering, Carleton UniversityAbstract Background Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D–structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. Results Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D–structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal’s descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D–structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. Conclusions The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches.http://link.springer.com/article/10.1186/s12859-017-1758-xEnzymeAlignment-free protein analysisProtein descriptorsSupport vector machinesProtDCalTI2BioP
spellingShingle Yasser B. Ruiz-Blanco
Guillermin Agüero-Chapin
Enrique García-Hernández
Orlando Álvarez
Agostinho Antunes
James Green
Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
BMC Bioinformatics
Enzyme
Alignment-free protein analysis
Protein descriptors
Support vector machines
ProtDCal
TI2BioP
title Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_full Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_fullStr Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_full_unstemmed Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_short Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_sort exploring general purpose protein features for distinguishing enzymes and non enzymes within the twilight zone
topic Enzyme
Alignment-free protein analysis
Protein descriptors
Support vector machines
ProtDCal
TI2BioP
url http://link.springer.com/article/10.1186/s12859-017-1758-x
work_keys_str_mv AT yasserbruizblanco exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT guillerminaguerochapin exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT enriquegarciahernandez exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT orlandoalvarez exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT agostinhoantunes exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT jamesgreen exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone