Protein Interaction Prediction Method Based on Feature Engineering and XGBoost

Human protein interaction prediction studies occupy an important place in systems biology. The understanding of human protein interaction networks and interactome will provide important insights into the regulation of developmental, physiological and pathological processes. In this study, we propose...

Full description

Bibliographic Details
Main Authors: Zhao Xiaoman, Wang Xue
Format: Article
Language:English
Published: EDP Sciences 2023-01-01
Series:BIO Web of Conferences
Online Access:https://www.bio-conferences.org/articles/bioconf/pdf/2023/06/bioconf_fbse2023_01021.pdf
_version_ 1797776016161112064
author Zhao Xiaoman
Wang Xue
author_facet Zhao Xiaoman
Wang Xue
author_sort Zhao Xiaoman
collection DOAJ
description Human protein interaction prediction studies occupy an important place in systems biology. The understanding of human protein interaction networks and interactome will provide important insights into the regulation of developmental, physiological and pathological processes. In this study, we propose a method based on feature engineering and integrated learning algorithms to construct protein interaction prediction models. Principal Component Analysis (PCA) and Locally Linear Embedding (LLE) dimensionality reduction methods were used to extract sequence features from the 174-dimensional human protein sequence vector after Normalized Difference Sequence Feature (NDSF) encoding, respectively. The classification performance of three integrated learning methods (AdaBoost, Extratrees, XGBoost) applied to PCA and LLE features was compared, and the best combination of parameters was found using cross-validation and grid search methods. The results show that the classification accuracy is significantly higher when using the linear dimensionality reduction method PCA than the nonlinear dimensionality reduction method LLE. the classification with XGBoost achieves a model accuracy of 99.2%, which is the best performance among all models. This study suggests that NDSF combined with PCA and XGBoost may be an effective strategy for classifying different human protein interactions.
first_indexed 2024-03-12T22:43:54Z
format Article
id doaj.art-1a28512564474480ab0de8168c66e4bb
institution Directory Open Access Journal
issn 2117-4458
language English
last_indexed 2024-03-12T22:43:54Z
publishDate 2023-01-01
publisher EDP Sciences
record_format Article
series BIO Web of Conferences
spelling doaj.art-1a28512564474480ab0de8168c66e4bb2023-07-21T09:24:56ZengEDP SciencesBIO Web of Conferences2117-44582023-01-01610102110.1051/bioconf/20236101021bioconf_fbse2023_01021Protein Interaction Prediction Method Based on Feature Engineering and XGBoostZhao Xiaoman0Wang Xue1Institute of Intelligent Machinery, Hefei Institutes of Physical Science, Chinese Academy of SciencesInstitute of Intelligent Machinery, Hefei Institutes of Physical Science, Chinese Academy of SciencesHuman protein interaction prediction studies occupy an important place in systems biology. The understanding of human protein interaction networks and interactome will provide important insights into the regulation of developmental, physiological and pathological processes. In this study, we propose a method based on feature engineering and integrated learning algorithms to construct protein interaction prediction models. Principal Component Analysis (PCA) and Locally Linear Embedding (LLE) dimensionality reduction methods were used to extract sequence features from the 174-dimensional human protein sequence vector after Normalized Difference Sequence Feature (NDSF) encoding, respectively. The classification performance of three integrated learning methods (AdaBoost, Extratrees, XGBoost) applied to PCA and LLE features was compared, and the best combination of parameters was found using cross-validation and grid search methods. The results show that the classification accuracy is significantly higher when using the linear dimensionality reduction method PCA than the nonlinear dimensionality reduction method LLE. the classification with XGBoost achieves a model accuracy of 99.2%, which is the best performance among all models. This study suggests that NDSF combined with PCA and XGBoost may be an effective strategy for classifying different human protein interactions.https://www.bio-conferences.org/articles/bioconf/pdf/2023/06/bioconf_fbse2023_01021.pdf
spellingShingle Zhao Xiaoman
Wang Xue
Protein Interaction Prediction Method Based on Feature Engineering and XGBoost
BIO Web of Conferences
title Protein Interaction Prediction Method Based on Feature Engineering and XGBoost
title_full Protein Interaction Prediction Method Based on Feature Engineering and XGBoost
title_fullStr Protein Interaction Prediction Method Based on Feature Engineering and XGBoost
title_full_unstemmed Protein Interaction Prediction Method Based on Feature Engineering and XGBoost
title_short Protein Interaction Prediction Method Based on Feature Engineering and XGBoost
title_sort protein interaction prediction method based on feature engineering and xgboost
url https://www.bio-conferences.org/articles/bioconf/pdf/2023/06/bioconf_fbse2023_01021.pdf
work_keys_str_mv AT zhaoxiaoman proteininteractionpredictionmethodbasedonfeatureengineeringandxgboost
AT wangxue proteininteractionpredictionmethodbasedonfeatureengineeringandxgboost