Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification,...

Full description

Bibliographic Details
Main Authors: Jin Tao, Kelly A. Brayton, Shira L. Broschat
Format: Article
Language:English
Published: MDPI AG 2020-12-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/1/24
_version_ 1827699471766519808
author Jin Tao
Kelly A. Brayton
Shira L. Broschat
author_facet Jin Tao
Kelly A. Brayton
Shira L. Broschat
author_sort Jin Tao
collection DOAJ
description Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.
first_indexed 2024-03-10T13:51:26Z
format Article
id doaj.art-4a2b6992715d4c23a80d1a5dc9c90ed3
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T13:51:26Z
publishDate 2020-12-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-4a2b6992715d4c23a80d1a5dc9c90ed32023-11-21T02:06:06ZengMDPI AGApplied Sciences2076-34172020-12-011112410.3390/app11010024Automated Confirmation of Protein Annotation Using NLP and the UniProtKB DatabaseJin Tao0Kelly A. Brayton1Shira L. Broschat2School of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USASchool of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USASchool of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USAAdvances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.https://www.mdpi.com/2076-3417/11/1/24natural language processingprotein annotationdeep learningensemble learningword embedding
spellingShingle Jin Tao
Kelly A. Brayton
Shira L. Broschat
Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
Applied Sciences
natural language processing
protein annotation
deep learning
ensemble learning
word embedding
title Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_full Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_fullStr Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_full_unstemmed Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_short Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_sort automated confirmation of protein annotation using nlp and the uniprotkb database
topic natural language processing
protein annotation
deep learning
ensemble learning
word embedding
url https://www.mdpi.com/2076-3417/11/1/24
work_keys_str_mv AT jintao automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase
AT kellyabrayton automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase
AT shiralbroschat automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase