Evaluation of semi-supervised classification algorithms with deep contextualizes document representations

Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are in...

Full description

Bibliographic Details
Main Author: Yong, Hao
Other Authors: Joty Shafiq Rayhan
Format: Final Year Project (FYP)
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147954
_version_ 1826120928553074688
author Yong, Hao
author2 Joty Shafiq Rayhan
author_facet Joty Shafiq Rayhan
Yong, Hao
author_sort Yong, Hao
collection NTU
description Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost.
first_indexed 2024-10-01T05:24:27Z
format Final Year Project (FYP)
id ntu-10356/147954
institution Nanyang Technological University
language English
last_indexed 2024-10-01T05:24:27Z
publishDate 2021
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1479542021-04-20T07:39:08Z Evaluation of semi-supervised classification algorithms with deep contextualizes document representations Yong, Hao Joty Shafiq Rayhan Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg, srjoty@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Information storage and retrieval Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost. Bachelor of Engineering (Computer Science) 2021-04-20T07:39:08Z 2021-04-20T07:39:08Z 2021 Final Year Project (FYP) Yong, H. (2021). Evaluation of semi-supervised classification algorithms with deep contextualizes document representations. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147954 https://hdl.handle.net/10356/147954 en SCSE20-0249 application/pdf Nanyang Technological University
spellingShingle Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Yong, Hao
Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_fullStr Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full_unstemmed Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_short Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_sort evaluation of semi supervised classification algorithms with deep contextualizes document representations
topic Engineering::Computer science and engineering::Information systems::Information storage and retrieval
url https://hdl.handle.net/10356/147954
work_keys_str_mv AT yonghao evaluationofsemisupervisedclassificationalgorithmswithdeepcontextualizesdocumentrepresentations