PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes

Abstract Background Enzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot t...

Full description

Bibliographic Details
Main Authors: Lei Chen, Chenyu Zhang, Jing Xu
Format: Article
Language:English
Published: BMC 2024-01-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-024-05665-1
_version_ 1797273028946558976
author Lei Chen
Chenyu Zhang
Jing Xu
author_facet Lei Chen
Chenyu Zhang
Jing Xu
author_sort Lei Chen
collection DOAJ
description Abstract Background Enzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot topic in the past twenty years. Several previous methods adopted functional domain composition to represent enzymes. However, it would lead to dimension disaster, thereby reducing the efficiency of the methods. On the other hand, most previous methods can only deal with enzymes belonging to one family class. In fact, several enzymes belong to two or more family classes. Results In this study, a fast and efficient multi-label classifier, named PredictEFC, was designed. To construct this classifier, a novel feature extraction scheme was designed for processing functional domain information of enzymes, which counting the distribution of each functional domain entry across seven family classes in the training dataset. Based on this scheme, each training or test enzyme was encoded into a 7-dimenion vector by fusing its functional domain information and above statistical results. Random k-labelsets (RAKEL) was adopted to build the classifier, where random forest was selected as the base classification algorithm. The two tenfold cross-validation results on the training dataset shown that the accuracy of PredictEFC can reach 0.8493 and 0.8370. The independent test on two datasets indicated the accuracy values of 0.9118 and 0.8777. Conclusion The performance of PredictEFC was slightly lower than the classifier directly using functional domain composition. However, its efficiency was sharply improved. The running time was less than one-tenth of the time of the classifier directly using functional domain composition. In additional, the utility of PredictEFC was superior to the classifiers using traditional dimensionality reduction methods and some previous methods, and this classifier can be transplanted for predicting enzyme family classes of other species. Finally, a web-server available at http://124.221.158.221/ was set up for easy usage.
first_indexed 2024-03-07T14:37:40Z
format Article
id doaj.art-463781f25c6c49fca717d1e33dfc2821
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-03-07T14:37:40Z
publishDate 2024-01-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-463781f25c6c49fca717d1e33dfc28212024-03-05T20:32:06ZengBMCBMC Bioinformatics1471-21052024-01-0125112710.1186/s12859-024-05665-1PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classesLei Chen0Chenyu Zhang1Jing Xu2College of Information Engineering, Shanghai Maritime UniversityCollege of Information Engineering, Shanghai Maritime UniversityCollege of Information Engineering, Shanghai Maritime UniversityAbstract Background Enzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot topic in the past twenty years. Several previous methods adopted functional domain composition to represent enzymes. However, it would lead to dimension disaster, thereby reducing the efficiency of the methods. On the other hand, most previous methods can only deal with enzymes belonging to one family class. In fact, several enzymes belong to two or more family classes. Results In this study, a fast and efficient multi-label classifier, named PredictEFC, was designed. To construct this classifier, a novel feature extraction scheme was designed for processing functional domain information of enzymes, which counting the distribution of each functional domain entry across seven family classes in the training dataset. Based on this scheme, each training or test enzyme was encoded into a 7-dimenion vector by fusing its functional domain information and above statistical results. Random k-labelsets (RAKEL) was adopted to build the classifier, where random forest was selected as the base classification algorithm. The two tenfold cross-validation results on the training dataset shown that the accuracy of PredictEFC can reach 0.8493 and 0.8370. The independent test on two datasets indicated the accuracy values of 0.9118 and 0.8777. Conclusion The performance of PredictEFC was slightly lower than the classifier directly using functional domain composition. However, its efficiency was sharply improved. The running time was less than one-tenth of the time of the classifier directly using functional domain composition. In additional, the utility of PredictEFC was superior to the classifiers using traditional dimensionality reduction methods and some previous methods, and this classifier can be transplanted for predicting enzyme family classes of other species. Finally, a web-server available at http://124.221.158.221/ was set up for easy usage.https://doi.org/10.1186/s12859-024-05665-1EnzymesFamily classMulti-label classificationFunctional domainDimension reductionRandom forest
spellingShingle Lei Chen
Chenyu Zhang
Jing Xu
PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes
BMC Bioinformatics
Enzymes
Family class
Multi-label classification
Functional domain
Dimension reduction
Random forest
title PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes
title_full PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes
title_fullStr PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes
title_full_unstemmed PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes
title_short PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes
title_sort predictefc a fast and efficient multi label classifier for predicting enzyme family classes
topic Enzymes
Family class
Multi-label classification
Functional domain
Dimension reduction
Random forest
url https://doi.org/10.1186/s12859-024-05665-1
work_keys_str_mv AT leichen predictefcafastandefficientmultilabelclassifierforpredictingenzymefamilyclasses
AT chenyuzhang predictefcafastandefficientmultilabelclassifierforpredictingenzymefamilyclasses
AT jingxu predictefcafastandefficientmultilabelclassifierforpredictingenzymefamilyclasses