Evaluation of semi-supervised classification algorithms with deep contextualizes document representations

Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are in...

Full description

Bibliographic Details
Main Author:	Yong, Hao
Other Authors:	Joty Shafiq Rayhan
Format:	Final Year Project (FYP)
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Online Access:	https://hdl.handle.net/10356/147954

_version_	1826120928553074688
author	Yong, Hao
author2	Joty Shafiq Rayhan
author_facet	Joty Shafiq Rayhan Yong, Hao
author_sort	Yong, Hao
collection	NTU
description	Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost.
first_indexed	2024-10-01T05:24:27Z
format	Final Year Project (FYP)
id	ntu-10356/147954
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T05:24:27Z
publishDate	2021
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1479542021-04-20T07:39:08Z Evaluation of semi-supervised classification algorithms with deep contextualizes document representations Yong, Hao Joty Shafiq Rayhan Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg, srjoty@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Information storage and retrieval Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost. Bachelor of Engineering (Computer Science) 2021-04-20T07:39:08Z 2021-04-20T07:39:08Z 2021 Final Year Project (FYP) Yong, H. (2021). Evaluation of semi-supervised classification algorithms with deep contextualizes document representations. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147954 https://hdl.handle.net/10356/147954 en SCSE20-0249 application/pdf Nanyang Technological University
spellingShingle	Engineering::Computer science and engineering::Information systems::Information storage and retrieval Yong, Hao Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_fullStr	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full_unstemmed	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_short	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_sort	evaluation of semi supervised classification algorithms with deep contextualizes document representations
topic	Engineering::Computer science and engineering::Information systems::Information storage and retrieval
url	https://hdl.handle.net/10356/147954
work_keys_str_mv	AT yonghao evaluationofsemisupervisedclassificationalgorithmswithdeepcontextualizesdocumentrepresentations

Evaluation of semi-supervised classification algorithms with deep contextualizes document representations

Similar Items