PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition

Named entity recognition (NER) in natural language processing encompasses three primary types: flat, nested, and discontinuous. While the flat type often garners attention from researchers, nested NER poses a significant challenge. Current approaches to addressing nested NER involve sequence labelin...

Full description

Bibliographic Details
Main Authors: Hongjian Yang, Qinghao Zhang, Hyuk-Chul Kwon
Format: Article
Language:English
Published: MDPI AG 2024-02-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/5/1717
_version_ 1797264870065831936
author Hongjian Yang
Qinghao Zhang
Hyuk-Chul Kwon
author_facet Hongjian Yang
Qinghao Zhang
Hyuk-Chul Kwon
author_sort Hongjian Yang
collection DOAJ
description Named entity recognition (NER) in natural language processing encompasses three primary types: flat, nested, and discontinuous. While the flat type often garners attention from researchers, nested NER poses a significant challenge. Current approaches to addressing nested NER involve sequence labeling methods with merged label layers, cascaded models, and those rooted in reading comprehension. Among these, sequence labeling with merged label layers stands out for its simplicity and ease of implementation. Yet, highlighted issues persist within this method, prompting our aim to enhance its efficacy. In this study, we propose augmentations to the sequence labeling approach by employing a pipeline model bifurcated into sequence labeling and text classification tasks. Departing from annotating specific entity categories, we amalgamated types into main and sub-categories for a unified treatment. These categories were subsequently embedded as identifiers in the recognition text for the text categorization task. Our choice of resolution involved BERT+BiLSTM+CRF for sequence labeling and the BERT model for text classification. Experiments were conducted across three nested NER datasets: GENIA, CMeEE, and GermEval 2014, featuring annotations varying from four to two levels. Before model training, we conducted separate statistical analyses on nested entities within the medical dataset CMeEE and the everyday life dataset GermEval 2014. Our research unveiled a consistent dominance of a particular entity category within nested entities across both datasets. This observation suggests the potential utility of labeling primary and subsidiary entities for effective category recognition. Model performance was evaluated based on F1 scores, considering correct recognition only when both the complete entity name and category were identified. Results showcased substantial performance enhancement after our proposed modifications compared to the original method. Additionally, our improved model exhibited strong competitiveness against existing models. F1 scores on the GENIA, CMeEE, and GermEval 2014 datasets reached 79.21, 66.71, and 87.81, respectively. Our research highlights that, while preserving the original method’s simplicity and implementation ease, our enhanced model achieves heightened performance and competitive prowess compared to other methodologies.
first_indexed 2024-04-25T00:35:46Z
format Article
id doaj.art-b8472369efc747d6b2e8758d9abe9aaa
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-04-25T00:35:46Z
publishDate 2024-02-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-b8472369efc747d6b2e8758d9abe9aaa2024-03-12T16:38:38ZengMDPI AGApplied Sciences2076-34172024-02-01145171710.3390/app14051717PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity RecognitionHongjian Yang0Qinghao Zhang1Hyuk-Chul Kwon2Center for Artificial Intelligence Research, Pusan National University, Busan 46241, Republic of KoreaCenter for Artificial Intelligence Research, Pusan National University, Busan 46241, Republic of KoreaCenter for Artificial Intelligence Research, Pusan National University, Busan 46241, Republic of KoreaNamed entity recognition (NER) in natural language processing encompasses three primary types: flat, nested, and discontinuous. While the flat type often garners attention from researchers, nested NER poses a significant challenge. Current approaches to addressing nested NER involve sequence labeling methods with merged label layers, cascaded models, and those rooted in reading comprehension. Among these, sequence labeling with merged label layers stands out for its simplicity and ease of implementation. Yet, highlighted issues persist within this method, prompting our aim to enhance its efficacy. In this study, we propose augmentations to the sequence labeling approach by employing a pipeline model bifurcated into sequence labeling and text classification tasks. Departing from annotating specific entity categories, we amalgamated types into main and sub-categories for a unified treatment. These categories were subsequently embedded as identifiers in the recognition text for the text categorization task. Our choice of resolution involved BERT+BiLSTM+CRF for sequence labeling and the BERT model for text classification. Experiments were conducted across three nested NER datasets: GENIA, CMeEE, and GermEval 2014, featuring annotations varying from four to two levels. Before model training, we conducted separate statistical analyses on nested entities within the medical dataset CMeEE and the everyday life dataset GermEval 2014. Our research unveiled a consistent dominance of a particular entity category within nested entities across both datasets. This observation suggests the potential utility of labeling primary and subsidiary entities for effective category recognition. Model performance was evaluated based on F1 scores, considering correct recognition only when both the complete entity name and category were identified. Results showcased substantial performance enhancement after our proposed modifications compared to the original method. Additionally, our improved model exhibited strong competitiveness against existing models. F1 scores on the GENIA, CMeEE, and GermEval 2014 datasets reached 79.21, 66.71, and 87.81, respectively. Our research highlights that, while preserving the original method’s simplicity and implementation ease, our enhanced model achieves heightened performance and competitive prowess compared to other methodologies.https://www.mdpi.com/2076-3417/14/5/1717nested entitynamed entity recognitionNERsequence labelingtext classificationmerged label
spellingShingle Hongjian Yang
Qinghao Zhang
Hyuk-Chul Kwon
PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition
Applied Sciences
nested entity
named entity recognition
NER
sequence labeling
text classification
merged label
title PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition
title_full PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition
title_fullStr PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition
title_full_unstemmed PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition
title_short PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition
title_sort pner applying the pipeline method to resolve nested issues in named entity recognition
topic nested entity
named entity recognition
NER
sequence labeling
text classification
merged label
url https://www.mdpi.com/2076-3417/14/5/1717
work_keys_str_mv AT hongjianyang pnerapplyingthepipelinemethodtoresolvenestedissuesinnamedentityrecognition
AT qinghaozhang pnerapplyingthepipelinemethodtoresolvenestedissuesinnamedentityrecognition
AT hyukchulkwon pnerapplyingthepipelinemethodtoresolvenestedissuesinnamedentityrecognition