Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases

The large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, m...

Full description

Bibliographic Details
Main Authors: Stefano Silvestri, Francesco Gargiulo, Mario Ciampi
Format: Article
Language:English
Published: MDPI AG 2022-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/12/5775
_version_ 1827662773227618304
author Stefano Silvestri
Francesco Gargiulo
Mario Ciampi
author_facet Stefano Silvestri
Francesco Gargiulo
Mario Ciampi
author_sort Stefano Silvestri
collection DOAJ
description The large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, makes it difficult to exploit the state-of-art machine-learning systems to extract information from such kinds of documents. For these reasons, healthcare professionals lose big opportunities that can arise from the analysis of this data. In this paper, we propose a methodology to reduce the manual efforts needed to annotate a biomedical named entity recognition (B-NER) corpus, exploiting both active learning and distant supervision, respectively based on deep learning models (e.g., Bi-LSTM, word2vec FastText, ELMo and BERT) and biomedical knowledge bases, in order to speed up the annotation task and limit class imbalance issues. We assessed this approach by creating an Italian-language electronic health record corpus annotated with biomedical domain entities in a small fraction of the time required for a fully manual annotation. The obtained corpus was used to train a B-NER deep neural network whose performances are comparable with the state of the art, with an F1-Score equal to 0.9661 and 0.8875 on two test sets.
first_indexed 2024-03-10T00:32:39Z
format Article
id doaj.art-91949e0f0e4446a6894a2e7c6276c91d
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T00:32:39Z
publishDate 2022-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-91949e0f0e4446a6894a2e7c6276c91d2023-11-23T15:22:07ZengMDPI AGApplied Sciences2076-34172022-06-011212577510.3390/app12125775Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge BasesStefano Silvestri0Francesco Gargiulo1Mario Ciampi2Institute for High Performance Computing and Networking of National Research Council, ICAR-CNR, Via Pietro Castellino 111, 80131 Naples, ItalyInstitute for High Performance Computing and Networking of National Research Council, ICAR-CNR, Via Pietro Castellino 111, 80131 Naples, ItalyInstitute for High Performance Computing and Networking of National Research Council, ICAR-CNR, Via Pietro Castellino 111, 80131 Naples, ItalyThe large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, makes it difficult to exploit the state-of-art machine-learning systems to extract information from such kinds of documents. For these reasons, healthcare professionals lose big opportunities that can arise from the analysis of this data. In this paper, we propose a methodology to reduce the manual efforts needed to annotate a biomedical named entity recognition (B-NER) corpus, exploiting both active learning and distant supervision, respectively based on deep learning models (e.g., Bi-LSTM, word2vec FastText, ELMo and BERT) and biomedical knowledge bases, in order to speed up the annotation task and limit class imbalance issues. We assessed this approach by creating an Italian-language electronic health record corpus annotated with biomedical domain entities in a small fraction of the time required for a fully manual annotation. The obtained corpus was used to train a B-NER deep neural network whose performances are comparable with the state of the art, with an F1-Score equal to 0.9661 and 0.8875 on two test sets.https://www.mdpi.com/2076-3417/12/12/5775biomedical NERcorpus annotationdistant supervisionactive learningdeep learning
spellingShingle Stefano Silvestri
Francesco Gargiulo
Mario Ciampi
Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases
Applied Sciences
biomedical NER
corpus annotation
distant supervision
active learning
deep learning
title Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases
title_full Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases
title_fullStr Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases
title_full_unstemmed Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases
title_short Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases
title_sort iterative annotation of biomedical ner corpora with deep neural networks and knowledge bases
topic biomedical NER
corpus annotation
distant supervision
active learning
deep learning
url https://www.mdpi.com/2076-3417/12/12/5775
work_keys_str_mv AT stefanosilvestri iterativeannotationofbiomedicalnercorporawithdeepneuralnetworksandknowledgebases
AT francescogargiulo iterativeannotationofbiomedicalnercorporawithdeepneuralnetworksandknowledgebases
AT mariociampi iterativeannotationofbiomedicalnercorporawithdeepneuralnetworksandknowledgebases