Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment

Abstract Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional on...

Full description

Bibliographic Details
Main Authors: Jaspreet Singh, Kuldip Paliwal, Thomas Litfin, Jaswinder Singh, Yaoqi Zhou
Format: Article
Language:English
Published: Nature Portfolio 2022-05-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-022-11684-w
_version_ 1811253194068590592
author Jaspreet Singh
Kuldip Paliwal
Thomas Litfin
Jaswinder Singh
Yaoqi Zhou
author_facet Jaspreet Singh
Kuldip Paliwal
Thomas Litfin
Jaswinder Singh
Yaoqi Zhou
author_sort Jaspreet Singh
collection DOAJ
description Abstract Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.
first_indexed 2024-04-12T16:46:17Z
format Article
id doaj.art-a1de07e4ee8b45c1b1d1fb4f88b98bf4
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-04-12T16:46:17Z
publishDate 2022-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-a1de07e4ee8b45c1b1d1fb4f88b98bf42022-12-22T03:24:33ZengNature PortfolioScientific Reports2045-23222022-05-011211910.1038/s41598-022-11684-wReaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignmentJaspreet Singh0Kuldip Paliwal1Thomas Litfin2Jaswinder Singh3Yaoqi Zhou4Signal Processing Laboratory, School of Engineering and Built Environment, Griffith UniversitySignal Processing Laboratory, School of Engineering and Built Environment, Griffith UniversitySignal Processing Laboratory, School of Engineering and Built Environment, Griffith UniversitySignal Processing Laboratory, School of Engineering and Built Environment, Griffith UniversityInstitute for Glycomics, Griffith UniversityAbstract Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.https://doi.org/10.1038/s41598-022-11684-w
spellingShingle Jaspreet Singh
Kuldip Paliwal
Thomas Litfin
Jaswinder Singh
Yaoqi Zhou
Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
Scientific Reports
title Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_full Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_fullStr Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_full_unstemmed Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_short Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_sort reaching alignment profile based accuracy in predicting protein secondary and tertiary structural properties without alignment
url https://doi.org/10.1038/s41598-022-11684-w
work_keys_str_mv AT jaspreetsingh reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT kuldippaliwal reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT thomaslitfin reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT jaswindersingh reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT yaoqizhou reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment