On the data scarcity problem of neural-based named entity recognition

The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes ev...

Full description

Bibliographic Details
Main Author:	Zhou, Ran
Other Authors:	Erik Cambria
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science
Online Access:	https://hdl.handle.net/10356/173481

_version_	1811693050592755712
author	Zhou, Ran
author2	Erik Cambria
author_facet	Erik Cambria Zhou, Ran
author_sort	Zhou, Ran
collection	NTU
description	The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical. This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively. With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance. Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data. Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries. Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language. For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages. Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language. In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows. Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings. Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods. Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance. In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated.
first_indexed	2024-10-01T06:45:31Z
format	Thesis-Doctor of Philosophy
id	ntu-10356/173481
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T06:45:31Z
publishDate	2024
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1734812024-03-07T08:52:06Z On the data scarcity problem of neural-based named entity recognition Zhou, Ran Erik Cambria Miao Chun Yan School of Computer Science and Engineering ASCYMiao@ntu.edu.sg, cambria@ntu.edu.sg Computer and Information Science The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical. This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively. With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance. Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data. Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries. Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language. For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages. Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language. In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows. Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings. Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods. Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance. In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated. Doctor of Philosophy 2024-02-07T05:22:01Z 2024-02-07T05:22:01Z 2023 Thesis-Doctor of Philosophy Zhou, R. (2023). On the data scarcity problem of neural-based named entity recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173481 https://hdl.handle.net/10356/173481 10.32657/10356/173481 en Alibaba Group through Alibaba Innovative Research (AIR) Program Alibaba-NTU Singapore Joint Research Institute (JRI) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle	Computer and Information Science Zhou, Ran On the data scarcity problem of neural-based named entity recognition
title	On the data scarcity problem of neural-based named entity recognition
title_full	On the data scarcity problem of neural-based named entity recognition
title_fullStr	On the data scarcity problem of neural-based named entity recognition
title_full_unstemmed	On the data scarcity problem of neural-based named entity recognition
title_short	On the data scarcity problem of neural-based named entity recognition
title_sort	on the data scarcity problem of neural based named entity recognition
topic	Computer and Information Science
url	https://hdl.handle.net/10356/173481
work_keys_str_mv	AT zhouran onthedatascarcityproblemofneuralbasednamedentityrecognition

On the data scarcity problem of neural-based named entity recognition

Similar Items