Advancing low resource information extraction and dialogue system using data efficient methods

This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods,...

Full description

Bibliographic Details
Main Author:	Ding, Bosheng
Other Authors:	Joty Shafiq Rayhan
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning
Online Access:	https://hdl.handle.net/10356/179560

_version_	1826117813501165568
author	Ding, Bosheng
author2	Joty Shafiq Rayhan
author_facet	Joty Shafiq Rayhan Ding, Bosheng
author_sort	Ding, Bosheng
collection	NTU
description	This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods, which are essential for enhancing the robustness and functionality of language models in environments with scarce data resources. At the heart of this thesis is the investigation of novel data augmentation approaches designed to enrich the training dataset. These include the creation of synthetic data through advanced algorithms, which generate realistic and varied linguistic examples to augment the training corpus without necessitating manual data annotation. Additionally, the study introduces techniques for semantic data transformation that modify existing data in semantically meaningful ways, thereby exposing models to a diverse range of linguistic structures and contexts. The research also addresses the utilization of these data augmentation methods to improve language models' resilience to overfitting, a frequent issue in low-resource settings. By diversifying and enriching the training dataset, the models achieve enhanced generalization capabilities, resulting in improved performance on new, unseen data. Further, the thesis explores the integration of these data augmentation techniques with current NLP models, highlighting the synergistic advantages of combining innovative data enrichment methods with cutting-edge language models. This integration not only increases model robustness but also broadens the models' applicability to a more diverse array of languages and dialects, especially those with sparse data. Moreover, in the era of Large Language Models (LLMs), this thesis explores algorithms that leverage LLMs' intrinsic abilities to comprehend and generate contextually appropriate augmentations, thus enriching training data while maintaining its quality. The empirical results presented in this thesis demonstrate the effectiveness of the proposed data augmentation techniques. These results reveal substantial enhancements in model accuracy, resilience, and generalization across various NLP tasks, including sentiment analysis, named entity recognition, part of speech tagging, relation extraction, and task-oriented dialogue systems. In summary, this thesis makes a significant contribution to NLP by introducing innovative data-efficient methods that bolster the resilience of language models in low-resource scenarios. The research findings and methodologies pave the way for future studies in enhancing language model robustness, thereby expanding the reach of NLP technologies to a broader spectrum of languages and applications. The thesis concludes by identifying and discussing several promising avenues for future research in this domain.
first_indexed	2024-10-01T04:33:29Z
format	Thesis-Doctor of Philosophy
id	ntu-10356/179560
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T04:33:29Z
publishDate	2024
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1795602024-09-04T07:56:36Z Advancing low resource information extraction and dialogue system using data efficient methods Ding, Bosheng Joty Shafiq Rayhan Luu Anh Tuan Miao Chun Yan School of Computer Science and Engineering srjoty@ntu.edu.sg, anhtuan.luu@ntu.edu.sg, ASCYMiao@ntu.edu.sg Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods, which are essential for enhancing the robustness and functionality of language models in environments with scarce data resources. At the heart of this thesis is the investigation of novel data augmentation approaches designed to enrich the training dataset. These include the creation of synthetic data through advanced algorithms, which generate realistic and varied linguistic examples to augment the training corpus without necessitating manual data annotation. Additionally, the study introduces techniques for semantic data transformation that modify existing data in semantically meaningful ways, thereby exposing models to a diverse range of linguistic structures and contexts. The research also addresses the utilization of these data augmentation methods to improve language models' resilience to overfitting, a frequent issue in low-resource settings. By diversifying and enriching the training dataset, the models achieve enhanced generalization capabilities, resulting in improved performance on new, unseen data. Further, the thesis explores the integration of these data augmentation techniques with current NLP models, highlighting the synergistic advantages of combining innovative data enrichment methods with cutting-edge language models. This integration not only increases model robustness but also broadens the models' applicability to a more diverse array of languages and dialects, especially those with sparse data. Moreover, in the era of Large Language Models (LLMs), this thesis explores algorithms that leverage LLMs' intrinsic abilities to comprehend and generate contextually appropriate augmentations, thus enriching training data while maintaining its quality. The empirical results presented in this thesis demonstrate the effectiveness of the proposed data augmentation techniques. These results reveal substantial enhancements in model accuracy, resilience, and generalization across various NLP tasks, including sentiment analysis, named entity recognition, part of speech tagging, relation extraction, and task-oriented dialogue systems. In summary, this thesis makes a significant contribution to NLP by introducing innovative data-efficient methods that bolster the resilience of language models in low-resource scenarios. The research findings and methodologies pave the way for future studies in enhancing language model robustness, thereby expanding the reach of NLP technologies to a broader spectrum of languages and applications. The thesis concludes by identifying and discussing several promising avenues for future research in this domain. Doctor of Philosophy 2024-08-12T04:27:19Z 2024-08-12T04:27:19Z 2024 Thesis-Doctor of Philosophy Ding, B. (2024). Advancing low resource information extraction and dialogue system using data efficient methods. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/179560 https://hdl.handle.net/10356/179560 10.32657/10356/179560 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle	Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning Ding, Bosheng Advancing low resource information extraction and dialogue system using data efficient methods
title	Advancing low resource information extraction and dialogue system using data efficient methods
title_full	Advancing low resource information extraction and dialogue system using data efficient methods
title_fullStr	Advancing low resource information extraction and dialogue system using data efficient methods
title_full_unstemmed	Advancing low resource information extraction and dialogue system using data efficient methods
title_short	Advancing low resource information extraction and dialogue system using data efficient methods
title_sort	advancing low resource information extraction and dialogue system using data efficient methods
topic	Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning
url	https://hdl.handle.net/10356/179560
work_keys_str_mv	AT dingbosheng advancinglowresourceinformationextractionanddialoguesystemusingdataefficientmethods

Advancing low resource information extraction and dialogue system using data efficient methods

Similar Items