Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry

Abstract Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different ar...

Full description

Bibliographic Details
Main Authors:	Anastasiya V. Kulikova, Daniel J. Diaz, Tianlong Chen, T. Jeffrey Cole, Andrew D. Ellington, Claus O. Wilke
Format:	Article
Language:	English
Published:	Nature Portfolio 2023-08-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-023-40247-w

_version_	1827723338817994752
author	Anastasiya V. Kulikova Daniel J. Diaz Tianlong Chen T. Jeffrey Cole Andrew D. Ellington Claus O. Wilke
author_facet	Anastasiya V. Kulikova Daniel J. Diaz Tianlong Chen T. Jeffrey Cole Andrew D. Ellington Claus O. Wilke
author_sort	Anastasiya V. Kulikova
collection	DOAJ
description	Abstract Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
first_indexed	2024-03-10T21:59:49Z
format	Article
id	doaj.art-de2a0e4bd981453f829bc025b68f3554
institution	Directory Open Access Journal
issn	2045-2322
language	English
last_indexed	2024-03-10T21:59:49Z
publishDate	2023-08-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj.art-de2a0e4bd981453f829bc025b68f35542023-11-19T13:01:04ZengNature PortfolioScientific Reports2045-23222023-08-011311910.1038/s41598-023-40247-wTwo sequence- and two structure-based ML models have learned different aspects of protein biochemistryAnastasiya V. Kulikova0Daniel J. Diaz1Tianlong Chen2T. Jeffrey Cole3Andrew D. Ellington4Claus O. Wilke5Department of Integrative Biology, University of Texas at AustinDepartment of Chemistry, The University of Texas at AustinInstitute for Foundations of Machine Learning (IFML), The University of Texas at AustinDepartment of Integrative Biology, University of Texas at AustinThe Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at AustinDepartment of Integrative Biology, University of Texas at AustinAbstract Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.https://doi.org/10.1038/s41598-023-40247-w
spellingShingle	Anastasiya V. Kulikova Daniel J. Diaz Tianlong Chen T. Jeffrey Cole Andrew D. Ellington Claus O. Wilke Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry Scientific Reports
title	Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry
title_full	Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry
title_fullStr	Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry
title_full_unstemmed	Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry
title_short	Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry
title_sort	two sequence and two structure based ml models have learned different aspects of protein biochemistry
url	https://doi.org/10.1038/s41598-023-40247-w
work_keys_str_mv	AT anastasiyavkulikova twosequenceandtwostructurebasedmlmodelshavelearneddifferentaspectsofproteinbiochemistry AT danieljdiaz twosequenceandtwostructurebasedmlmodelshavelearneddifferentaspectsofproteinbiochemistry AT tianlongchen twosequenceandtwostructurebasedmlmodelshavelearneddifferentaspectsofproteinbiochemistry AT tjeffreycole twosequenceandtwostructurebasedmlmodelshavelearneddifferentaspectsofproteinbiochemistry AT andrewdellington twosequenceandtwostructurebasedmlmodelshavelearneddifferentaspectsofproteinbiochemistry AT clausowilke twosequenceandtwostructurebasedmlmodelshavelearneddifferentaspectsofproteinbiochemistry

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry

Similar Items