Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification,...

Full description

Bibliographic Details
Main Authors:	Jin Tao, Kelly A. Brayton, Shira L. Broschat
Format:	Article
Language:	English
Published:	MDPI AG 2020-12-01
Series:	Applied Sciences
Subjects:	natural language processing protein annotation deep learning ensemble learning word embedding
Online Access:	https://www.mdpi.com/2076-3417/11/1/24

_version_	1827699471766519808
author	Jin Tao Kelly A. Brayton Shira L. Broschat
author_facet	Jin Tao Kelly A. Brayton Shira L. Broschat
author_sort	Jin Tao
collection	DOAJ
description	Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.
first_indexed	2024-03-10T13:51:26Z
format	Article
id	doaj.art-4a2b6992715d4c23a80d1a5dc9c90ed3
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T13:51:26Z
publishDate	2020-12-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-4a2b6992715d4c23a80d1a5dc9c90ed32023-11-21T02:06:06ZengMDPI AGApplied Sciences2076-34172020-12-011112410.3390/app11010024Automated Confirmation of Protein Annotation Using NLP and the UniProtKB DatabaseJin Tao0Kelly A. Brayton1Shira L. Broschat2School of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USASchool of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USASchool of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman, WA 99164-2752, USAAdvances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.https://www.mdpi.com/2076-3417/11/1/24natural language processingprotein annotationdeep learningensemble learningword embedding
spellingShingle	Jin Tao Kelly A. Brayton Shira L. Broschat Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database Applied Sciences natural language processing protein annotation deep learning ensemble learning word embedding
title	Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_full	Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_fullStr	Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_full_unstemmed	Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_short	Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database
title_sort	automated confirmation of protein annotation using nlp and the uniprotkb database
topic	natural language processing protein annotation deep learning ensemble learning word embedding
url	https://www.mdpi.com/2076-3417/11/1/24
work_keys_str_mv	AT jintao automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase AT kellyabrayton automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase AT shiralbroschat automatedconfirmationofproteinannotationusingnlpandtheuniprotkbdatabase

Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Similar Items