Survey of Protein Sequence Embedding Models
Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-02-01
|
Series: | International Journal of Molecular Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/1422-0067/24/4/3775 |
_version_ | 1797620470657318912 |
---|---|
author | Chau Tran Siddharth Khadkikar Aleksey Porollo |
author_facet | Chau Tran Siddharth Khadkikar Aleksey Porollo |
author_sort | Chau Tran |
collection | DOAJ |
description | Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the <i>Saccharomyces cerevisiae</i> proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from <i>Escherichia coli</i> with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC). |
first_indexed | 2024-03-11T08:41:55Z |
format | Article |
id | doaj.art-6e2769a28be74d83a8cfe48876d9c323 |
institution | Directory Open Access Journal |
issn | 1661-6596 1422-0067 |
language | English |
last_indexed | 2024-03-11T08:41:55Z |
publishDate | 2023-02-01 |
publisher | MDPI AG |
record_format | Article |
series | International Journal of Molecular Sciences |
spelling | doaj.art-6e2769a28be74d83a8cfe48876d9c3232023-11-16T21:04:51ZengMDPI AGInternational Journal of Molecular Sciences1661-65961422-00672023-02-01244377510.3390/ijms24043775Survey of Protein Sequence Embedding ModelsChau Tran0Siddharth Khadkikar1Aleksey Porollo2Department of Computer Science, University of Cincinnati, Cincinnati, OH 45219, USADepartment of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USACenter for Autoimmune Genomics and Etiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USADerived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the <i>Saccharomyces cerevisiae</i> proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from <i>Escherichia coli</i> with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).https://www.mdpi.com/1422-0067/24/4/3775deep learningnatural language processingprotein annotationprotein language modelprotein sequence embeddingsurvey of embedding models |
spellingShingle | Chau Tran Siddharth Khadkikar Aleksey Porollo Survey of Protein Sequence Embedding Models International Journal of Molecular Sciences deep learning natural language processing protein annotation protein language model protein sequence embedding survey of embedding models |
title | Survey of Protein Sequence Embedding Models |
title_full | Survey of Protein Sequence Embedding Models |
title_fullStr | Survey of Protein Sequence Embedding Models |
title_full_unstemmed | Survey of Protein Sequence Embedding Models |
title_short | Survey of Protein Sequence Embedding Models |
title_sort | survey of protein sequence embedding models |
topic | deep learning natural language processing protein annotation protein language model protein sequence embedding survey of embedding models |
url | https://www.mdpi.com/1422-0067/24/4/3775 |
work_keys_str_mv | AT chautran surveyofproteinsequenceembeddingmodels AT siddharthkhadkikar surveyofproteinsequenceembeddingmodels AT alekseyporollo surveyofproteinsequenceembeddingmodels |