Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.

Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from...

Full description

Bibliographic Details
Main Authors:	Yi-Heng Zhu, Chengxin Zhang, Dong-Jun Yu, Yang Zhang
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2022-12-01
Series:	PLoS Computational Biology
Online Access:	https://doi.org/10.1371/journal.pcbi.1010793

_version_	1811167242894704640
author	Yi-Heng Zhu Chengxin Zhang Dong-Jun Yu Yang Zhang
author_facet	Yi-Heng Zhu Chengxin Zhang Dong-Jun Yu Yang Zhang
author_sort	Yi-Heng Zhu
collection	DOAJ
description	Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.
first_indexed	2024-04-10T16:06:00Z
format	Article
id	doaj.art-74f7eeda86db429fa0637f8c7d4cd378
institution	Directory Open Access Journal
issn	1553-734X 1553-7358
language	English
last_indexed	2024-04-10T16:06:00Z
publishDate	2022-12-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS Computational Biology
spelling	doaj.art-74f7eeda86db429fa0637f8c7d4cd3782023-02-10T05:30:46ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582022-12-011812e101079310.1371/journal.pcbi.1010793Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.Yi-Heng ZhuChengxin ZhangDong-Jun YuYang ZhangAccurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.https://doi.org/10.1371/journal.pcbi.1010793
spellingShingle	Yi-Heng Zhu Chengxin Zhang Dong-Jun Yu Yang Zhang Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction. PLoS Computational Biology
title	Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.
title_full	Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.
title_fullStr	Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.
title_full_unstemmed	Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.
title_short	Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.
title_sort	integrating unsupervised language model with triplet neural networks for protein gene ontology prediction
url	https://doi.org/10.1371/journal.pcbi.1010793
work_keys_str_mv	AT yihengzhu integratingunsupervisedlanguagemodelwithtripletneuralnetworksforproteingeneontologyprediction AT chengxinzhang integratingunsupervisedlanguagemodelwithtripletneuralnetworksforproteingeneontologyprediction AT dongjunyu integratingunsupervisedlanguagemodelwithtripletneuralnetworksforproteingeneontologyprediction AT yangzhang integratingunsupervisedlanguagemodelwithtripletneuralnetworksforproteingeneontologyprediction

Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.

Similar Items