Sentic computing for social good: sentiment analysis on toxic comment

With the neural network revolutions and increased computational power, Artificial Intelligence has been applied in various fields for improving life, such as concept-level sentiment analysis. We focused on one of the sentiment analysis applications: toxic comments detection. These inappropriate mess...

Full description

Bibliographic Details
Main Author:	Wang Jingtan
Other Authors:	Erik Cambria
Format:	Final Year Project (FYP)
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/156503

_version_	1826126737888509952
author	Wang Jingtan
author2	Erik Cambria
author_facet	Erik Cambria Wang Jingtan
author_sort	Wang Jingtan
collection	NTU
description	With the neural network revolutions and increased computational power, Artificial Intelligence has been applied in various fields for improving life, such as concept-level sentiment analysis. We focused on one of the sentiment analysis applications: toxic comments detection. These inappropriate messages, hiding in the massive data, result in verbal violence to the receiver. Therefore, we aimed to detect the toxicity of content given raw textual input, outputting whether toxic or not. We selected an open-source multilabel dataset with around 150k samples. Each sentence is assigned 6 categories of toxic behaviors. We intended to predict the belonging of a text in these 6 labels. To achieve this, we reviewed and experimented the state-of-art methods in this field, known as the pre-trained model. We then improved the models based on the issues we noticed during experiments: imbalanced multilabel. We reviewed various approaches discussed in papers and journals, such as external knowledge of minority labels, cost-sensitive metrics, and resampling. We then compared them for an effective way to address the imbalance. Note that due to resources constraint, we only sampled ten percent of original data for our experimentation. Overall, we discovered the best fitting pre-trained model, BERT, and improved it in the imbalanced multilabel classification by using focal loss and random oversampling. We hope the reviews, the experimentation, and the result can contribute to the toxic comment challenge. We also pointed out the limitation in this project: the lack of resources and some unexpected behaviors, as well as possible future directions: active learning and data-augmentation supported resampling.
first_indexed	2024-10-01T06:57:34Z
format	Final Year Project (FYP)
id	ntu-10356/156503
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T06:57:34Z
publishDate	2022
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1565032022-04-18T08:49:51Z Sentic computing for social good: sentiment analysis on toxic comment Wang Jingtan Erik Cambria School of Computer Science and Engineering cambria@ntu.edu.sg Engineering::Computer science and engineering With the neural network revolutions and increased computational power, Artificial Intelligence has been applied in various fields for improving life, such as concept-level sentiment analysis. We focused on one of the sentiment analysis applications: toxic comments detection. These inappropriate messages, hiding in the massive data, result in verbal violence to the receiver. Therefore, we aimed to detect the toxicity of content given raw textual input, outputting whether toxic or not. We selected an open-source multilabel dataset with around 150k samples. Each sentence is assigned 6 categories of toxic behaviors. We intended to predict the belonging of a text in these 6 labels. To achieve this, we reviewed and experimented the state-of-art methods in this field, known as the pre-trained model. We then improved the models based on the issues we noticed during experiments: imbalanced multilabel. We reviewed various approaches discussed in papers and journals, such as external knowledge of minority labels, cost-sensitive metrics, and resampling. We then compared them for an effective way to address the imbalance. Note that due to resources constraint, we only sampled ten percent of original data for our experimentation. Overall, we discovered the best fitting pre-trained model, BERT, and improved it in the imbalanced multilabel classification by using focal loss and random oversampling. We hope the reviews, the experimentation, and the result can contribute to the toxic comment challenge. We also pointed out the limitation in this project: the lack of resources and some unexpected behaviors, as well as possible future directions: active learning and data-augmentation supported resampling. Bachelor of Engineering (Computer Science) 2022-04-18T08:49:50Z 2022-04-18T08:49:50Z 2022 Final Year Project (FYP) Wang Jingtan (2022). Sentic computing for social good: sentiment analysis on toxic comment. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156503 https://hdl.handle.net/10356/156503 en SCSE21-0238 application/pdf Nanyang Technological University
spellingShingle	Engineering::Computer science and engineering Wang Jingtan Sentic computing for social good: sentiment analysis on toxic comment
title	Sentic computing for social good: sentiment analysis on toxic comment
title_full	Sentic computing for social good: sentiment analysis on toxic comment
title_fullStr	Sentic computing for social good: sentiment analysis on toxic comment
title_full_unstemmed	Sentic computing for social good: sentiment analysis on toxic comment
title_short	Sentic computing for social good: sentiment analysis on toxic comment
title_sort	sentic computing for social good sentiment analysis on toxic comment
topic	Engineering::Computer science and engineering
url	https://hdl.handle.net/10356/156503
work_keys_str_mv	AT wangjingtan senticcomputingforsocialgoodsentimentanalysisontoxiccomment

Sentic computing for social good: sentiment analysis on toxic comment

Similar Items