Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions

Abstract Background Protein–protein interactions (PPIs) are heavily involved in many biological processes. Consequently, the identification of PPIs in the model plant Arabidopsis is of great significance to deeply understand plant growth and development, and then to promote the basic research of cro...

Full description

Bibliographic Details
Main Authors: Kewei Zhou, Chenping Lei, Jingyan Zheng, Yan Huang, Ziding Zhang
Format: Article
Language:English
Published: BMC 2023-12-01
Series:Plant Methods
Subjects:
Online Access:https://doi.org/10.1186/s13007-023-01119-6
_version_ 1827590517423079424
author Kewei Zhou
Chenping Lei
Jingyan Zheng
Yan Huang
Ziding Zhang
author_facet Kewei Zhou
Chenping Lei
Jingyan Zheng
Yan Huang
Ziding Zhang
author_sort Kewei Zhou
collection DOAJ
description Abstract Background Protein–protein interactions (PPIs) are heavily involved in many biological processes. Consequently, the identification of PPIs in the model plant Arabidopsis is of great significance to deeply understand plant growth and development, and then to promote the basic research of crop improvement. Although many experimental Arabidopsis PPIs have been determined currently, the known interactomic data of Arabidopsis is far from complete. In this context, developing effective machine learning models from existing PPI data to predict unknown Arabidopsis PPIs conveniently and rapidly is still urgently needed. Results We used a large-scale pre-trained protein language model (pLM) called ESM-1b to convert protein sequences into high-dimensional vectors and then used them as the input of multilayer perceptron (MLP). To avoid the performance overestimation frequently occurring in PPI prediction, we employed stringent datasets to train and evaluate the predictive model. The results showed that the combination of ESM-1b and MLP (i.e., ESMAraPPI) achieved more accurate performance than the predictive models inferred from other pLMs or baseline sequence encoding schemes. In particular, the proposed ESMAraPPI yielded an AUPR value of 0.810 when tested on an independent test set where both proteins in each protein pair are unseen in the training dataset, suggesting its strong generalization and extrapolating ability. Moreover, the proposed ESMAraPPI model performed better than several state-of-the-art generic or plant-specific PPI predictors. Conclusion Protein sequence embeddings from the pre-trained model ESM-1b contain rich protein semantic information. By combining with the MLP algorithm, ESM-1b revealed excellent performance in predicting Arabidopsis PPIs. We anticipate that the proposed predictive model (ESMAraPPI) can serve as a very competitive tool to accelerate the identification of Arabidopsis interactome.
first_indexed 2024-03-09T01:17:57Z
format Article
id doaj.art-c93c607a091f43d595b2be87a4a34e29
institution Directory Open Access Journal
issn 1746-4811
language English
last_indexed 2024-03-09T01:17:57Z
publishDate 2023-12-01
publisher BMC
record_format Article
series Plant Methods
spelling doaj.art-c93c607a091f43d595b2be87a4a34e292023-12-10T12:20:25ZengBMCPlant Methods1746-48112023-12-0119111010.1186/s13007-023-01119-6Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactionsKewei Zhou0Chenping Lei1Jingyan Zheng2Yan Huang3Ziding Zhang4State Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural UniversityState Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural UniversityState Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural UniversityState Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural UniversityState Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural UniversityAbstract Background Protein–protein interactions (PPIs) are heavily involved in many biological processes. Consequently, the identification of PPIs in the model plant Arabidopsis is of great significance to deeply understand plant growth and development, and then to promote the basic research of crop improvement. Although many experimental Arabidopsis PPIs have been determined currently, the known interactomic data of Arabidopsis is far from complete. In this context, developing effective machine learning models from existing PPI data to predict unknown Arabidopsis PPIs conveniently and rapidly is still urgently needed. Results We used a large-scale pre-trained protein language model (pLM) called ESM-1b to convert protein sequences into high-dimensional vectors and then used them as the input of multilayer perceptron (MLP). To avoid the performance overestimation frequently occurring in PPI prediction, we employed stringent datasets to train and evaluate the predictive model. The results showed that the combination of ESM-1b and MLP (i.e., ESMAraPPI) achieved more accurate performance than the predictive models inferred from other pLMs or baseline sequence encoding schemes. In particular, the proposed ESMAraPPI yielded an AUPR value of 0.810 when tested on an independent test set where both proteins in each protein pair are unseen in the training dataset, suggesting its strong generalization and extrapolating ability. Moreover, the proposed ESMAraPPI model performed better than several state-of-the-art generic or plant-specific PPI predictors. Conclusion Protein sequence embeddings from the pre-trained model ESM-1b contain rich protein semantic information. By combining with the MLP algorithm, ESM-1b revealed excellent performance in predicting Arabidopsis PPIs. We anticipate that the proposed predictive model (ESMAraPPI) can serve as a very competitive tool to accelerate the identification of Arabidopsis interactome.https://doi.org/10.1186/s13007-023-01119-6ArabidopsisProtein–protein interactionsMachine learningPre-trained language modelNatural language processing
spellingShingle Kewei Zhou
Chenping Lei
Jingyan Zheng
Yan Huang
Ziding Zhang
Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions
Plant Methods
Arabidopsis
Protein–protein interactions
Machine learning
Pre-trained language model
Natural language processing
title Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions
title_full Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions
title_fullStr Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions
title_full_unstemmed Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions
title_short Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions
title_sort pre trained protein language model sheds new light on the prediction of arabidopsis protein protein interactions
topic Arabidopsis
Protein–protein interactions
Machine learning
Pre-trained language model
Natural language processing
url https://doi.org/10.1186/s13007-023-01119-6
work_keys_str_mv AT keweizhou pretrainedproteinlanguagemodelshedsnewlightonthepredictionofarabidopsisproteinproteininteractions
AT chenpinglei pretrainedproteinlanguagemodelshedsnewlightonthepredictionofarabidopsisproteinproteininteractions
AT jingyanzheng pretrainedproteinlanguagemodelshedsnewlightonthepredictionofarabidopsisproteinproteininteractions
AT yanhuang pretrainedproteinlanguagemodelshedsnewlightonthepredictionofarabidopsisproteinproteininteractions
AT zidingzhang pretrainedproteinlanguagemodelshedsnewlightonthepredictionofarabidopsisproteinproteininteractions