Extract antibody and antigen names from biomedical literature

Abstract Background The roles of antibody and antigen are indispensable in targeted diagnosis, therapy, and biomedical discovery. On top of that, massive numbers of new scientific articles about antibodies and/or antigens are published each year, which is a precious knowledge resource but has yet be...

Full description

Bibliographic Details
Main Authors: Thuy Trang Dinh, Trang Phuong Vo-Chanh, Chau Nguyen, Viet Quoc Huynh, Nam Vo, Hoang Duc Nguyen
Format: Article
Language:English
Published: BMC 2022-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-022-04993-4
_version_ 1811204001633402880
author Thuy Trang Dinh
Trang Phuong Vo-Chanh
Chau Nguyen
Viet Quoc Huynh
Nam Vo
Hoang Duc Nguyen
author_facet Thuy Trang Dinh
Trang Phuong Vo-Chanh
Chau Nguyen
Viet Quoc Huynh
Nam Vo
Hoang Duc Nguyen
author_sort Thuy Trang Dinh
collection DOAJ
description Abstract Background The roles of antibody and antigen are indispensable in targeted diagnosis, therapy, and biomedical discovery. On top of that, massive numbers of new scientific articles about antibodies and/or antigens are published each year, which is a precious knowledge resource but has yet been exploited to its full potential. We, therefore, aim to develop a biomedical natural language processing tool that can automatically identify antibody and antigen entities from articles. Results We first annotated an antibody-antigen corpus including 3210 relevant PubMed abstracts using a semi-automatic approach. The Inter-Annotator Agreement score of 3 annotators ranges from 91.46 to 94.31%, indicating that the annotations are consistent and the corpus is reliable. We then used the corpus to develop and optimize BiLSTM-CRF-based and BioBERT-based models. The models achieved overall F1 scores of 62.49% and 81.44%, respectively, which showed potential for newly studied entities. The two models served as foundation for development of a named entity recognition (NER) tool that automatically recognizes antibody and antigen names from biomedical literature. Conclusions Our antibody-antigen NER models enable users to automatically extract antibody and antigen names from scientific articles without manually scanning through vast amounts of data and information in the literature. The output of NER can be used to automatically populate antibody-antigen databases, support antibody validation, and facilitate researchers with the most appropriate antibodies of interest. The packaged NER model is available at https://github.com/TrangDinh44/ABAG_BioBERT.git .
first_indexed 2024-04-12T03:04:25Z
format Article
id doaj.art-da2e5a068cb747b4bb37b598f94ee0fd
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-12T03:04:25Z
publishDate 2022-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-da2e5a068cb747b4bb37b598f94ee0fd2022-12-22T03:50:33ZengBMCBMC Bioinformatics1471-21052022-12-0123112110.1186/s12859-022-04993-4Extract antibody and antigen names from biomedical literatureThuy Trang Dinh0Trang Phuong Vo-Chanh1Chau Nguyen2Viet Quoc Huynh3Nam Vo4Hoang Duc Nguyen5Center for Bioscience and Biotechnology, University of ScienceCenter for Bioscience and Biotechnology, University of ScienceCenter for Bioscience and Biotechnology, University of ScienceCenter for Bioscience and Biotechnology, University of ScienceCenter for Bioscience and Biotechnology, University of ScienceCenter for Bioscience and Biotechnology, University of ScienceAbstract Background The roles of antibody and antigen are indispensable in targeted diagnosis, therapy, and biomedical discovery. On top of that, massive numbers of new scientific articles about antibodies and/or antigens are published each year, which is a precious knowledge resource but has yet been exploited to its full potential. We, therefore, aim to develop a biomedical natural language processing tool that can automatically identify antibody and antigen entities from articles. Results We first annotated an antibody-antigen corpus including 3210 relevant PubMed abstracts using a semi-automatic approach. The Inter-Annotator Agreement score of 3 annotators ranges from 91.46 to 94.31%, indicating that the annotations are consistent and the corpus is reliable. We then used the corpus to develop and optimize BiLSTM-CRF-based and BioBERT-based models. The models achieved overall F1 scores of 62.49% and 81.44%, respectively, which showed potential for newly studied entities. The two models served as foundation for development of a named entity recognition (NER) tool that automatically recognizes antibody and antigen names from biomedical literature. Conclusions Our antibody-antigen NER models enable users to automatically extract antibody and antigen names from scientific articles without manually scanning through vast amounts of data and information in the literature. The output of NER can be used to automatically populate antibody-antigen databases, support antibody validation, and facilitate researchers with the most appropriate antibodies of interest. The packaged NER model is available at https://github.com/TrangDinh44/ABAG_BioBERT.git .https://doi.org/10.1186/s12859-022-04993-4AntibodyAntigenCorpusNamed entity recognitionBioNLPSemi-automatic annotation
spellingShingle Thuy Trang Dinh
Trang Phuong Vo-Chanh
Chau Nguyen
Viet Quoc Huynh
Nam Vo
Hoang Duc Nguyen
Extract antibody and antigen names from biomedical literature
BMC Bioinformatics
Antibody
Antigen
Corpus
Named entity recognition
BioNLP
Semi-automatic annotation
title Extract antibody and antigen names from biomedical literature
title_full Extract antibody and antigen names from biomedical literature
title_fullStr Extract antibody and antigen names from biomedical literature
title_full_unstemmed Extract antibody and antigen names from biomedical literature
title_short Extract antibody and antigen names from biomedical literature
title_sort extract antibody and antigen names from biomedical literature
topic Antibody
Antigen
Corpus
Named entity recognition
BioNLP
Semi-automatic annotation
url https://doi.org/10.1186/s12859-022-04993-4
work_keys_str_mv AT thuytrangdinh extractantibodyandantigennamesfrombiomedicalliterature
AT trangphuongvochanh extractantibodyandantigennamesfrombiomedicalliterature
AT chaunguyen extractantibodyandantigennamesfrombiomedicalliterature
AT vietquochuynh extractantibodyandantigennamesfrombiomedicalliterature
AT namvo extractantibodyandantigennamesfrombiomedicalliterature
AT hoangducnguyen extractantibodyandantigennamesfrombiomedicalliterature