Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification,...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-12-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/1/24 |
_version_ | 1827699471766519808 |
---|---|
author | Jin Tao Kelly A. Brayton Shira L. Broschat |
author_facet | Jin Tao Kelly A. Brayton Shira L. Broschat |
author_sort | Jin Tao |
collection | DOAJ |
description | Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data. |
first_indexed | 2024-03-10T13:51:26Z |
format | Article |
id | doaj.art-4a2b6992715d4c23a80d1a5dc9c90ed3 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T13:51:26Z |
publishDate | 2020-12-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-4a2b6992715d4c23a80d1a5dc9c90ed32023-11-21T02:06:06ZengMDPI AGApplied Sciences2076-34172020-12-011112410.3390/app11010024Automated Confirmation of Protein Annotation Using NLP and the UniProtKB DatabaseJin Tao0Kelly A. Brayton1Shira L. Broschat2School of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USASchool of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USASchool of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USAAdvances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.https://www.mdpi.com/2076-3417/11/1/24natural language processingprotein annotationdeep learningensemble learningword embedding |
spellingShingle | Jin Tao Kelly A. Brayton Shira L. Broschat Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database Applied Sciences natural language processing protein annotation deep learning ensemble learning word embedding |
title | Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database |
title_full | Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database |
title_fullStr | Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database |
title_full_unstemmed | Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database |
title_short | Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database |
title_sort | automated confirmation of protein annotation using nlp and the uniprotkb database |
topic | natural language processing protein annotation deep learning ensemble learning word embedding |
url | https://www.mdpi.com/2076-3417/11/1/24 |
work_keys_str_mv | AT jintao automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase AT kellyabrayton automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase AT shiralbroschat automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase |