Advancing low resource information extraction and dialogue system using data efficient methods

This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods,...

Full description

Bibliographic Details
Main Author: Ding, Bosheng
Other Authors: Joty Shafiq Rayhan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/179560
_version_ 1826117813501165568
author Ding, Bosheng
author2 Joty Shafiq Rayhan
author_facet Joty Shafiq Rayhan
Ding, Bosheng
author_sort Ding, Bosheng
collection NTU
description This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods, which are essential for enhancing the robustness and functionality of language models in environments with scarce data resources. At the heart of this thesis is the investigation of novel data augmentation approaches designed to enrich the training dataset. These include the creation of synthetic data through advanced algorithms, which generate realistic and varied linguistic examples to augment the training corpus without necessitating manual data annotation. Additionally, the study introduces techniques for semantic data transformation that modify existing data in semantically meaningful ways, thereby exposing models to a diverse range of linguistic structures and contexts. The research also addresses the utilization of these data augmentation methods to improve language models' resilience to overfitting, a frequent issue in low-resource settings. By diversifying and enriching the training dataset, the models achieve enhanced generalization capabilities, resulting in improved performance on new, unseen data. Further, the thesis explores the integration of these data augmentation techniques with current NLP models, highlighting the synergistic advantages of combining innovative data enrichment methods with cutting-edge language models. This integration not only increases model robustness but also broadens the models' applicability to a more diverse array of languages and dialects, especially those with sparse data. Moreover, in the era of Large Language Models (LLMs), this thesis explores algorithms that leverage LLMs' intrinsic abilities to comprehend and generate contextually appropriate augmentations, thus enriching training data while maintaining its quality. The empirical results presented in this thesis demonstrate the effectiveness of the proposed data augmentation techniques. These results reveal substantial enhancements in model accuracy, resilience, and generalization across various NLP tasks, including sentiment analysis, named entity recognition, part of speech tagging, relation extraction, and task-oriented dialogue systems. In summary, this thesis makes a significant contribution to NLP by introducing innovative data-efficient methods that bolster the resilience of language models in low-resource scenarios. The research findings and methodologies pave the way for future studies in enhancing language model robustness, thereby expanding the reach of NLP technologies to a broader spectrum of languages and applications. The thesis concludes by identifying and discussing several promising avenues for future research in this domain.
first_indexed 2024-10-01T04:33:29Z
format Thesis-Doctor of Philosophy
id ntu-10356/179560
institution Nanyang Technological University
language English
last_indexed 2024-10-01T04:33:29Z
publishDate 2024
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1795602024-09-04T07:56:36Z Advancing low resource information extraction and dialogue system using data efficient methods Ding, Bosheng Joty Shafiq Rayhan Luu Anh Tuan Miao Chun Yan School of Computer Science and Engineering srjoty@ntu.edu.sg, anhtuan.luu@ntu.edu.sg, ASCYMiao@ntu.edu.sg Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods, which are essential for enhancing the robustness and functionality of language models in environments with scarce data resources. At the heart of this thesis is the investigation of novel data augmentation approaches designed to enrich the training dataset. These include the creation of synthetic data through advanced algorithms, which generate realistic and varied linguistic examples to augment the training corpus without necessitating manual data annotation. Additionally, the study introduces techniques for semantic data transformation that modify existing data in semantically meaningful ways, thereby exposing models to a diverse range of linguistic structures and contexts. The research also addresses the utilization of these data augmentation methods to improve language models' resilience to overfitting, a frequent issue in low-resource settings. By diversifying and enriching the training dataset, the models achieve enhanced generalization capabilities, resulting in improved performance on new, unseen data. Further, the thesis explores the integration of these data augmentation techniques with current NLP models, highlighting the synergistic advantages of combining innovative data enrichment methods with cutting-edge language models. This integration not only increases model robustness but also broadens the models' applicability to a more diverse array of languages and dialects, especially those with sparse data. Moreover, in the era of Large Language Models (LLMs), this thesis explores algorithms that leverage LLMs' intrinsic abilities to comprehend and generate contextually appropriate augmentations, thus enriching training data while maintaining its quality. The empirical results presented in this thesis demonstrate the effectiveness of the proposed data augmentation techniques. These results reveal substantial enhancements in model accuracy, resilience, and generalization across various NLP tasks, including sentiment analysis, named entity recognition, part of speech tagging, relation extraction, and task-oriented dialogue systems. In summary, this thesis makes a significant contribution to NLP by introducing innovative data-efficient methods that bolster the resilience of language models in low-resource scenarios. The research findings and methodologies pave the way for future studies in enhancing language model robustness, thereby expanding the reach of NLP technologies to a broader spectrum of languages and applications. The thesis concludes by identifying and discussing several promising avenues for future research in this domain. Doctor of Philosophy 2024-08-12T04:27:19Z 2024-08-12T04:27:19Z 2024 Thesis-Doctor of Philosophy Ding, B. (2024). Advancing low resource information extraction and dialogue system using data efficient methods. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/179560 https://hdl.handle.net/10356/179560 10.32657/10356/179560 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Computer and Information Science
Natural language processing
Large language models
Artificial intelligence
Machine learning
Ding, Bosheng
Advancing low resource information extraction and dialogue system using data efficient methods
title Advancing low resource information extraction and dialogue system using data efficient methods
title_full Advancing low resource information extraction and dialogue system using data efficient methods
title_fullStr Advancing low resource information extraction and dialogue system using data efficient methods
title_full_unstemmed Advancing low resource information extraction and dialogue system using data efficient methods
title_short Advancing low resource information extraction and dialogue system using data efficient methods
title_sort advancing low resource information extraction and dialogue system using data efficient methods
topic Computer and Information Science
Natural language processing
Large language models
Artificial intelligence
Machine learning
url https://hdl.handle.net/10356/179560
work_keys_str_mv AT dingbosheng advancinglowresourceinformationextractionanddialoguesystemusingdataefficientmethods