Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter

Twitter enables millions of active users to send and read concise messages on the internet every day. Yet some people use Twitter to propagate violent and threatening messages resulting in cyberbullying. Previous research has focused on whether cyberbullying behavior exists or not in a tweet (binary...

Full description

Bibliographic Details
Main Authors: Bandeh Ali Talpur, Declan O’Sullivan
Format: Article
Language:English
Published: MDPI AG 2020-11-01
Series:Informatics
Subjects:
Online Access:https://www.mdpi.com/2227-9709/7/4/52
_version_ 1797547844384587776
author Bandeh Ali Talpur
Declan O’Sullivan
author_facet Bandeh Ali Talpur
Declan O’Sullivan
author_sort Bandeh Ali Talpur
collection DOAJ
description Twitter enables millions of active users to send and read concise messages on the internet every day. Yet some people use Twitter to propagate violent and threatening messages resulting in cyberbullying. Previous research has focused on whether cyberbullying behavior exists or not in a tweet (binary classification). In this research, we developed a model for detecting the severity of cyberbullying in a tweet. The developed model is a feature-based model that uses features from the content of a tweet, to develop a machine learning classifier for classifying the tweets as non-cyberbullied, and low, medium, or high-level cyberbullied tweets. In this study, we introduced pointwise semantic orientation as a new input feature along with utilizing predicted features (gender, age, and personality type) and Twitter API features. Results from experiments with our proposed framework in a multi-class setting are promising both with respect to Kappa (84%), classifier accuracy (93%), and F-measure (92%) metric. Overall, 40% of the classifiers increased performance in comparison with baseline approaches. Our analysis shows that features with the highest odd ratio: for detecting low-level severity include: age group between 19–22 years and users with <1 year of Twitter account activation; for medium-level severity: neuroticism, age group between 23–29 years, and being a Twitter user between one to two years; and for high-level severity: neuroticism and extraversion, and the number of times tweet has been favorited by other users. We believe that this research using a multi-class classification approach provides a step forward in identifying severity at different levels (low, medium, high) when the content of a tweet is classified as cyberbullied. Lastly, the current study only focused on the Twitter platform; other social network platforms can be investigated using the same approach to detect cyberbullying severity patterns.
first_indexed 2024-03-10T14:50:00Z
format Article
id doaj.art-aafd51f9165d4138a9c096814cd53f6f
institution Directory Open Access Journal
issn 2227-9709
language English
last_indexed 2024-03-10T14:50:00Z
publishDate 2020-11-01
publisher MDPI AG
record_format Article
series Informatics
spelling doaj.art-aafd51f9165d4138a9c096814cd53f6f2023-11-20T21:02:48ZengMDPI AGInformatics2227-97092020-11-01745210.3390/informatics7040052Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in TwitterBandeh Ali Talpur0Declan O’Sullivan1School of Computer Science and Statistics, Trinity College Dublin, D02 PN40 Dublin, IrelandSchool of Computer Science and Statistics, Trinity College Dublin, D02 PN40 Dublin, IrelandTwitter enables millions of active users to send and read concise messages on the internet every day. Yet some people use Twitter to propagate violent and threatening messages resulting in cyberbullying. Previous research has focused on whether cyberbullying behavior exists or not in a tweet (binary classification). In this research, we developed a model for detecting the severity of cyberbullying in a tweet. The developed model is a feature-based model that uses features from the content of a tweet, to develop a machine learning classifier for classifying the tweets as non-cyberbullied, and low, medium, or high-level cyberbullied tweets. In this study, we introduced pointwise semantic orientation as a new input feature along with utilizing predicted features (gender, age, and personality type) and Twitter API features. Results from experiments with our proposed framework in a multi-class setting are promising both with respect to Kappa (84%), classifier accuracy (93%), and F-measure (92%) metric. Overall, 40% of the classifiers increased performance in comparison with baseline approaches. Our analysis shows that features with the highest odd ratio: for detecting low-level severity include: age group between 19–22 years and users with <1 year of Twitter account activation; for medium-level severity: neuroticism, age group between 23–29 years, and being a Twitter user between one to two years; and for high-level severity: neuroticism and extraversion, and the number of times tweet has been favorited by other users. We believe that this research using a multi-class classification approach provides a step forward in identifying severity at different levels (low, medium, high) when the content of a tweet is classified as cyberbullied. Lastly, the current study only focused on the Twitter platform; other social network platforms can be investigated using the same approach to detect cyberbullying severity patterns.https://www.mdpi.com/2227-9709/7/4/52cyberbullyingTwittersocial networksalgorithms
spellingShingle Bandeh Ali Talpur
Declan O’Sullivan
Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter
Informatics
cyberbullying
Twitter
social networks
algorithms
title Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter
title_full Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter
title_fullStr Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter
title_full_unstemmed Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter
title_short Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter
title_sort multi class imbalance in text classification a feature engineering approach to detect cyberbullying in twitter
topic cyberbullying
Twitter
social networks
algorithms
url https://www.mdpi.com/2227-9709/7/4/52
work_keys_str_mv AT bandehalitalpur multiclassimbalanceintextclassificationafeatureengineeringapproachtodetectcyberbullyingintwitter
AT declanosullivan multiclassimbalanceintextclassificationafeatureengineeringapproachtodetectcyberbullyingintwitter