Label semantics embedding and hierarchical attentions for text representation learning

Text classification is one of the most widely-used and important NLP (Natural Language Processing) tasks that aim to deduce the most proper pre-defined label for a given document or sentence, such as spam detection, topic classification, sentiment analysis, and so forth. One of the key steps of text...

Full description

Bibliographic Details
Main Author: Min, Fuzhou
Other Authors: Mao Kezhi
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165286
Description
Summary:Text classification is one of the most widely-used and important NLP (Natural Language Processing) tasks that aim to deduce the most proper pre-defined label for a given document or sentence, such as spam detection, topic classification, sentiment analysis, and so forth. One of the key steps of text classification is text representation. With the rapid development of machine learning, neural network models such as Convolutional Neural Networks and Recurrent Neural Networks have been commonly employed for achieving text representation learning. Currently, in most existing text classification models, the labels of the classification task used by models are always represented as one-hot vectors, without the dependence on the semantics of text data itself. For example, in a sentiment analysis task, the labels “positive”and “negative ”are encoded as [1,0] and [0,1], and thesemantic information of the labels is not made full use of. However, the semantics oflabels are highly related to the text classification task. Therefore, the information con tained in labels can not be disregarded. In this thesis work, we propose a Label Embedding-based Hierarchical Attention Model (LE-HAM) incorporating the semantic information of labels. We implement the semantic information of labels by jointly embedding the labels and words. Further, to solve the other problem, the structure of a single attention mechanism does not achieve satisfac tory results for data with weak signals. We introduce a model that includes a two-level attention framework based on the label semantics embedding. This hierarchical attention structure aims at the text data with weak signals in the tasks, seeks to exploit the label information to choose the key sentences first, then use only these selected sentences combined with label information to build a text representation. Therefore, the major ity of noises can be removed. The main novelty is that this method creatively uses the sentence-chosen mechanism. In this way, the model can find the key sentences when there are many noises in the text, then keywords can also be located more efficiently, and the accuracy of the text classification task on those “weak signal”datasets can be improved.