Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention

Abstract Background Protein–protein interactions (PPIs) are crucial in various biological functions and cellular processes. Thus, many computational approaches have been proposed to predict PPI sites. Although significant progress has been made, these methods still have limitations in encoding the c...

Full description

Bibliographic Details
Main Authors: Hanhan Cong, Hong Liu, Yi Cao, Cheng Liang, Yuehui Chen
Format: Article
Language:English
Published: BMC 2023-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-023-05592-7
_version_ 1797397765867700224
author Hanhan Cong
Hong Liu
Yi Cao
Cheng Liang
Yuehui Chen
author_facet Hanhan Cong
Hong Liu
Yi Cao
Cheng Liang
Yuehui Chen
author_sort Hanhan Cong
collection DOAJ
description Abstract Background Protein–protein interactions (PPIs) are crucial in various biological functions and cellular processes. Thus, many computational approaches have been proposed to predict PPI sites. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in sequences. Many feature extraction methods rely on the sliding window technique, which simply merges all the features of residues into a vector. The importance of some key residues may be weakened in the feature vector, leading to poor performance. Results We propose a novel sequence-based method for PPI sites prediction. The new network model, PPINet, contains multiple feature processing paths. For a residue, the PPINet extracts the features of the targeted residue and its context separately. These two types of features are processed by two paths in the network and combined to form a protein representation, where the two types of features are of relatively equal importance. The model ensembling technique is applied to make use of more features. The base models are trained with different features and then ensembled via stacking. In addition, a data balancing strategy is presented, by which our model can get significant improvement on highly unbalanced data. Conclusion The proposed method is evaluated on a fused dataset constructed from Dset186, Dset_72, and PDBset_164, as well as the public Dset_448 dataset. Compared with current state-of-the-art methods, the performance of our method is better than the others. In the most important metrics, such as AUPRC and recall, it surpasses the second-best programmer on the latter dataset by 6.9% and 4.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model, especially, the hybrid feature. We share our code for reproducibility and future research at https://github.com/CandiceCong/StackingPPINet .
first_indexed 2024-03-09T01:14:53Z
format Article
id doaj.art-a4e4a11da08441f9860c26d5f75b848a
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-03-09T01:14:53Z
publishDate 2023-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-a4e4a11da08441f9860c26d5f75b848a2023-12-10T12:33:41ZengBMCBMC Bioinformatics1471-21052023-12-0124112110.1186/s12859-023-05592-7Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attentionHanhan Cong0Hong Liu1Yi Cao2Cheng Liang3Yuehui Chen4School of Information Science and Engineering, Shandong Normal UniversitySchool of Information Science and Engineering, Shandong Normal UniversitySchool of Information Science and Engineering, University of JinanSchool of Information Science and Engineering, Shandong Normal UniversitySchool of Information Science and Engineering, University of JinanAbstract Background Protein–protein interactions (PPIs) are crucial in various biological functions and cellular processes. Thus, many computational approaches have been proposed to predict PPI sites. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in sequences. Many feature extraction methods rely on the sliding window technique, which simply merges all the features of residues into a vector. The importance of some key residues may be weakened in the feature vector, leading to poor performance. Results We propose a novel sequence-based method for PPI sites prediction. The new network model, PPINet, contains multiple feature processing paths. For a residue, the PPINet extracts the features of the targeted residue and its context separately. These two types of features are processed by two paths in the network and combined to form a protein representation, where the two types of features are of relatively equal importance. The model ensembling technique is applied to make use of more features. The base models are trained with different features and then ensembled via stacking. In addition, a data balancing strategy is presented, by which our model can get significant improvement on highly unbalanced data. Conclusion The proposed method is evaluated on a fused dataset constructed from Dset186, Dset_72, and PDBset_164, as well as the public Dset_448 dataset. Compared with current state-of-the-art methods, the performance of our method is better than the others. In the most important metrics, such as AUPRC and recall, it surpasses the second-best programmer on the latter dataset by 6.9% and 4.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model, especially, the hybrid feature. We share our code for reproducibility and future research at https://github.com/CandiceCong/StackingPPINet .https://doi.org/10.1186/s12859-023-05592-7Protein–protein interactionHybrid featureSelf-attentionIntegration framework
spellingShingle Hanhan Cong
Hong Liu
Yi Cao
Cheng Liang
Yuehui Chen
Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention
BMC Bioinformatics
Protein–protein interaction
Hybrid feature
Self-attention
Integration framework
title Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention
title_full Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention
title_fullStr Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention
title_full_unstemmed Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention
title_short Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention
title_sort protein protein interaction site prediction by model ensembling with hybrid feature and self attention
topic Protein–protein interaction
Hybrid feature
Self-attention
Integration framework
url https://doi.org/10.1186/s12859-023-05592-7
work_keys_str_mv AT hanhancong proteinproteininteractionsitepredictionbymodelensemblingwithhybridfeatureandselfattention
AT hongliu proteinproteininteractionsitepredictionbymodelensemblingwithhybridfeatureandselfattention
AT yicao proteinproteininteractionsitepredictionbymodelensemblingwithhybridfeatureandselfattention
AT chengliang proteinproteininteractionsitepredictionbymodelensemblingwithhybridfeatureandselfattention
AT yuehuichen proteinproteininteractionsitepredictionbymodelensemblingwithhybridfeatureandselfattention