Survey of Protein Sequence Embedding Models

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b...

Full description

Bibliographic Details
Main Authors: Chau Tran, Siddharth Khadkikar, Aleksey Porollo
Format: Article
Language:English
Published: MDPI AG 2023-02-01
Series:International Journal of Molecular Sciences
Subjects:
Online Access:https://www.mdpi.com/1422-0067/24/4/3775
_version_ 1797620470657318912
author Chau Tran
Siddharth Khadkikar
Aleksey Porollo
author_facet Chau Tran
Siddharth Khadkikar
Aleksey Porollo
author_sort Chau Tran
collection DOAJ
description Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the <i>Saccharomyces cerevisiae</i> proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from <i>Escherichia coli</i> with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).
first_indexed 2024-03-11T08:41:55Z
format Article
id doaj.art-6e2769a28be74d83a8cfe48876d9c323
institution Directory Open Access Journal
issn 1661-6596
1422-0067
language English
last_indexed 2024-03-11T08:41:55Z
publishDate 2023-02-01
publisher MDPI AG
record_format Article
series International Journal of Molecular Sciences
spelling doaj.art-6e2769a28be74d83a8cfe48876d9c3232023-11-16T21:04:51ZengMDPI AGInternational Journal of Molecular Sciences1661-65961422-00672023-02-01244377510.3390/ijms24043775Survey of Protein Sequence Embedding ModelsChau Tran0Siddharth Khadkikar1Aleksey Porollo2Department of Computer Science, University of Cincinnati, Cincinnati, OH 45219, USADepartment of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USACenter for Autoimmune Genomics and Etiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USADerived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the <i>Saccharomyces cerevisiae</i> proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from <i>Escherichia coli</i> with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).https://www.mdpi.com/1422-0067/24/4/3775deep learningnatural language processingprotein annotationprotein language modelprotein sequence embeddingsurvey of embedding models
spellingShingle Chau Tran
Siddharth Khadkikar
Aleksey Porollo
Survey of Protein Sequence Embedding Models
International Journal of Molecular Sciences
deep learning
natural language processing
protein annotation
protein language model
protein sequence embedding
survey of embedding models
title Survey of Protein Sequence Embedding Models
title_full Survey of Protein Sequence Embedding Models
title_fullStr Survey of Protein Sequence Embedding Models
title_full_unstemmed Survey of Protein Sequence Embedding Models
title_short Survey of Protein Sequence Embedding Models
title_sort survey of protein sequence embedding models
topic deep learning
natural language processing
protein annotation
protein language model
protein sequence embedding
survey of embedding models
url https://www.mdpi.com/1422-0067/24/4/3775
work_keys_str_mv AT chautran surveyofproteinsequenceembeddingmodels
AT siddharthkhadkikar surveyofproteinsequenceembeddingmodels
AT alekseyporollo surveyofproteinsequenceembeddingmodels