Advancing low resource information extraction and dialogue system using data efficient methods
This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods,...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/179560 |
_version_ | 1826117813501165568 |
---|---|
author | Ding, Bosheng |
author2 | Joty Shafiq Rayhan |
author_facet | Joty Shafiq Rayhan Ding, Bosheng |
author_sort | Ding, Bosheng |
collection | NTU |
description | This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods, which are essential for enhancing the robustness and functionality of language models in environments with scarce data resources.
At the heart of this thesis is the investigation of novel data augmentation approaches designed to enrich the training dataset. These include the creation of synthetic data through advanced algorithms, which generate realistic and varied linguistic examples to augment the training corpus without necessitating manual data annotation. Additionally, the study introduces techniques for semantic data transformation that modify existing data in semantically meaningful ways, thereby exposing models to a diverse range of linguistic structures and contexts.
The research also addresses the utilization of these data augmentation methods to improve language models' resilience to overfitting, a frequent issue in low-resource settings. By diversifying and enriching the training dataset, the models achieve enhanced generalization capabilities, resulting in improved performance on new, unseen data.
Further, the thesis explores the integration of these data augmentation techniques with current NLP models, highlighting the synergistic advantages of combining innovative data enrichment methods with cutting-edge language models. This integration not only increases model robustness but also broadens the models' applicability to a more diverse array of languages and dialects, especially those with sparse data.
Moreover, in the era of Large Language Models (LLMs), this thesis explores algorithms that leverage LLMs' intrinsic abilities to comprehend and generate contextually appropriate augmentations, thus enriching training data while maintaining its quality.
The empirical results presented in this thesis demonstrate the effectiveness of the proposed data augmentation techniques. These results reveal substantial enhancements in model accuracy, resilience, and generalization across various NLP tasks, including sentiment analysis, named entity recognition, part of speech tagging, relation extraction, and task-oriented dialogue systems.
In summary, this thesis makes a significant contribution to NLP by introducing innovative data-efficient methods that bolster the resilience of language models in low-resource scenarios. The research findings and methodologies pave the way for future studies in enhancing language model robustness, thereby expanding the reach of NLP technologies to a broader spectrum of languages and applications. The thesis concludes by identifying and discussing several promising avenues for future research in this domain. |
first_indexed | 2024-10-01T04:33:29Z |
format | Thesis-Doctor of Philosophy |
id | ntu-10356/179560 |
institution | Nanyang Technological University |
language | English |
last_indexed | 2024-10-01T04:33:29Z |
publishDate | 2024 |
publisher | Nanyang Technological University |
record_format | dspace |
spelling | ntu-10356/1795602024-09-04T07:56:36Z Advancing low resource information extraction and dialogue system using data efficient methods Ding, Bosheng Joty Shafiq Rayhan Luu Anh Tuan Miao Chun Yan School of Computer Science and Engineering srjoty@ntu.edu.sg, anhtuan.luu@ntu.edu.sg, ASCYMiao@ntu.edu.sg Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning This thesis presents an extensive study aimed at improving the efficacy of language models in situations characterized by limited data resources, a prevalent challenge in the field of natural language processing (NLP). The research emphasizes the development and refinement of data-efficient methods, which are essential for enhancing the robustness and functionality of language models in environments with scarce data resources. At the heart of this thesis is the investigation of novel data augmentation approaches designed to enrich the training dataset. These include the creation of synthetic data through advanced algorithms, which generate realistic and varied linguistic examples to augment the training corpus without necessitating manual data annotation. Additionally, the study introduces techniques for semantic data transformation that modify existing data in semantically meaningful ways, thereby exposing models to a diverse range of linguistic structures and contexts. The research also addresses the utilization of these data augmentation methods to improve language models' resilience to overfitting, a frequent issue in low-resource settings. By diversifying and enriching the training dataset, the models achieve enhanced generalization capabilities, resulting in improved performance on new, unseen data. Further, the thesis explores the integration of these data augmentation techniques with current NLP models, highlighting the synergistic advantages of combining innovative data enrichment methods with cutting-edge language models. This integration not only increases model robustness but also broadens the models' applicability to a more diverse array of languages and dialects, especially those with sparse data. Moreover, in the era of Large Language Models (LLMs), this thesis explores algorithms that leverage LLMs' intrinsic abilities to comprehend and generate contextually appropriate augmentations, thus enriching training data while maintaining its quality. The empirical results presented in this thesis demonstrate the effectiveness of the proposed data augmentation techniques. These results reveal substantial enhancements in model accuracy, resilience, and generalization across various NLP tasks, including sentiment analysis, named entity recognition, part of speech tagging, relation extraction, and task-oriented dialogue systems. In summary, this thesis makes a significant contribution to NLP by introducing innovative data-efficient methods that bolster the resilience of language models in low-resource scenarios. The research findings and methodologies pave the way for future studies in enhancing language model robustness, thereby expanding the reach of NLP technologies to a broader spectrum of languages and applications. The thesis concludes by identifying and discussing several promising avenues for future research in this domain. Doctor of Philosophy 2024-08-12T04:27:19Z 2024-08-12T04:27:19Z 2024 Thesis-Doctor of Philosophy Ding, B. (2024). Advancing low resource information extraction and dialogue system using data efficient methods. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/179560 https://hdl.handle.net/10356/179560 10.32657/10356/179560 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |
spellingShingle | Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning Ding, Bosheng Advancing low resource information extraction and dialogue system using data efficient methods |
title | Advancing low resource information extraction and dialogue system using data efficient methods |
title_full | Advancing low resource information extraction and dialogue system using data efficient methods |
title_fullStr | Advancing low resource information extraction and dialogue system using data efficient methods |
title_full_unstemmed | Advancing low resource information extraction and dialogue system using data efficient methods |
title_short | Advancing low resource information extraction and dialogue system using data efficient methods |
title_sort | advancing low resource information extraction and dialogue system using data efficient methods |
topic | Computer and Information Science Natural language processing Large language models Artificial intelligence Machine learning |
url | https://hdl.handle.net/10356/179560 |
work_keys_str_mv | AT dingbosheng advancinglowresourceinformationextractionanddialoguesystemusingdataefficientmethods |