Integrating Embeddings from Multiple Protein Language Models to Improve Protein <i>O</i>-GlcNAc Site Prediction
<i>O</i>-linked β-<i>N</i>-acetylglucosamine (<i>O</i>-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. <i>O</i>-GlcNAc modification (i.e., <i>O</i>-G...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-11-01
|
Series: | International Journal of Molecular Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/1422-0067/24/21/16000 |
Summary: | <i>O</i>-linked β-<i>N</i>-acetylglucosamine (<i>O</i>-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. <i>O</i>-GlcNAc modification (i.e., <i>O</i>-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping <i>O</i>-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of <i>O</i>-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein <i>O</i>-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of <i>O</i>-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of <i>O</i>-GlcNAc sites will facilitate the probing of <i>O</i>-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site’s web server and source code are publicly available to the community. |
---|---|
ISSN: | 1661-6596 1422-0067 |