Protein embedding based alignment

Abstract Purpose Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20–35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation...

Full description

Bibliographic Details
Main Authors: Benjamin Giovanni Iovino, Yuzhen Ye
Format: Article
Language:English
Published: BMC 2024-02-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-024-05699-5
_version_ 1797273006741913600
author Benjamin Giovanni Iovino
Yuzhen Ye
author_facet Benjamin Giovanni Iovino
Yuzhen Ye
author_sort Benjamin Giovanni Iovino
collection DOAJ
description Abstract Purpose Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20–35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. Conclusion Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
first_indexed 2024-03-07T14:37:27Z
format Article
id doaj.art-0924ae79c6104d58a44e5fb015f4ef49
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-03-07T14:37:27Z
publishDate 2024-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-0924ae79c6104d58a44e5fb015f4ef492024-03-05T20:31:52ZengBMCBMC Bioinformatics1471-21052024-02-0125111610.1186/s12859-024-05699-5Protein embedding based alignmentBenjamin Giovanni Iovino0Yuzhen Ye1Luddy School of Informatics, Computing and Engineering, Indiana UniversityLuddy School of Informatics, Computing and Engineering, Indiana UniversityAbstract Purpose Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20–35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. Conclusion Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.https://doi.org/10.1186/s12859-024-05699-5Protein embeddingProtein sequence alignmentSmith-Waterman algorithmTwilight zone
spellingShingle Benjamin Giovanni Iovino
Yuzhen Ye
Protein embedding based alignment
BMC Bioinformatics
Protein embedding
Protein sequence alignment
Smith-Waterman algorithm
Twilight zone
title Protein embedding based alignment
title_full Protein embedding based alignment
title_fullStr Protein embedding based alignment
title_full_unstemmed Protein embedding based alignment
title_short Protein embedding based alignment
title_sort protein embedding based alignment
topic Protein embedding
Protein sequence alignment
Smith-Waterman algorithm
Twilight zone
url https://doi.org/10.1186/s12859-024-05699-5
work_keys_str_mv AT benjamingiovanniiovino proteinembeddingbasedalignment
AT yuzhenye proteinembeddingbasedalignment