On the data scarcity problem of neural-based named entity recognition

The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes ev...

Full description

Bibliographic Details
Main Author: Zhou, Ran
Other Authors: Erik Cambria
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/173481
_version_ 1811693050592755712
author Zhou, Ran
author2 Erik Cambria
author_facet Erik Cambria
Zhou, Ran
author_sort Zhou, Ran
collection NTU
description The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical. This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively. With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance. Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data. Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries. Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language. For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages. Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language. In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows. Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings. Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods. Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance. In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated.
first_indexed 2024-10-01T06:45:31Z
format Thesis-Doctor of Philosophy
id ntu-10356/173481
institution Nanyang Technological University
language English
last_indexed 2024-10-01T06:45:31Z
publishDate 2024
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1734812024-03-07T08:52:06Z On the data scarcity problem of neural-based named entity recognition Zhou, Ran Erik Cambria Miao Chun Yan School of Computer Science and Engineering ASCYMiao@ntu.edu.sg, cambria@ntu.edu.sg Computer and Information Science The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical. This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively. With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance. Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data. Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries. Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language. For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages. Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language. In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows. Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings. Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods. Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance. In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated. Doctor of Philosophy 2024-02-07T05:22:01Z 2024-02-07T05:22:01Z 2023 Thesis-Doctor of Philosophy Zhou, R. (2023). On the data scarcity problem of neural-based named entity recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173481 https://hdl.handle.net/10356/173481 10.32657/10356/173481 en Alibaba Group through Alibaba Innovative Research (AIR) Program Alibaba-NTU Singapore Joint Research Institute (JRI) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Computer and Information Science
Zhou, Ran
On the data scarcity problem of neural-based named entity recognition
title On the data scarcity problem of neural-based named entity recognition
title_full On the data scarcity problem of neural-based named entity recognition
title_fullStr On the data scarcity problem of neural-based named entity recognition
title_full_unstemmed On the data scarcity problem of neural-based named entity recognition
title_short On the data scarcity problem of neural-based named entity recognition
title_sort on the data scarcity problem of neural based named entity recognition
topic Computer and Information Science
url https://hdl.handle.net/10356/173481
work_keys_str_mv AT zhouran onthedatascarcityproblemofneuralbasednamedentityrecognition