Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques

Social media platforms have become a substratum for people to enunciate their opinions and ideas across the globe. Due to anonymity preservation and freedom of expression, it is possible to humiliate individuals and groups, disregarding social etiquette online, inevitably proliferating and diversify...

Full description

Bibliographic Details
Main Authors:	Amirita Dewani, Mohsin Ali Memon, Sania Bhatti, Adel Sulaiman, Mohammed Hamdi, Hani Alshahrani, Abdullah Alghamdi, Asadullah Shaikh
Format:	Article
Language:	English
Published:	MDPI AG 2023-02-01
Series:	Applied Sciences
Subjects:	natural language processing low resource Roman Urdu language cyberbullying detection machine learning ensemble learning hate speech
Online Access:	https://www.mdpi.com/2076-3417/13/4/2062

_version_	1827758885368233984
author	Amirita Dewani Mohsin Ali Memon Sania Bhatti Adel Sulaiman Mohammed Hamdi Hani Alshahrani Abdullah Alghamdi Asadullah Shaikh
author_facet	Amirita Dewani Mohsin Ali Memon Sania Bhatti Adel Sulaiman Mohammed Hamdi Hani Alshahrani Abdullah Alghamdi Asadullah Shaikh
author_sort	Amirita Dewani
collection	DOAJ
description	Social media platforms have become a substratum for people to enunciate their opinions and ideas across the globe. Due to anonymity preservation and freedom of expression, it is possible to humiliate individuals and groups, disregarding social etiquette online, inevitably proliferating and diversifying the incidents of cyberbullying and cyber hate speech. This intimidating problem has recently sought the attention of researchers and scholars worldwide. Still, the current practices to sift the online content and offset the hatred spread do not go far enough. One factor contributing to this is the recent prevalence of regional languages in social media, the dearth of language resources, and flexible detection approaches, specifically for low-resource languages. In this context, most existing studies are oriented towards traditional resource-rich languages and highlight a huge gap in recently embraced resource-poor languages. One such language currently adopted worldwide and more typically by South Asian users for textual communication on social networks is Roman Urdu. It is derived from Urdu and written using a Left-to-Right pattern and Roman scripting. This language elicits numerous computational challenges while performing natural language preprocessing tasks due to its inflections, derivations, lexical variations, and morphological richness. To alleviate this problem, this research proposes a cyberbullying detection approach for analyzing textual data in the Roman Urdu language based on advanced preprocessing methods, voting-based ensemble techniques, and machine learning algorithms. The study has extracted a vast number of features, including statistical features, word N-Grams, combined n-grams, and BOW model with TFIDF weighting in different experimental settings using GridSearchCV and cross-validation techniques. The detection approach has been designed to tackle users’ textual input by considering user-specific writing styles on social media in a colloquial and non-standard form. The experimental results show that SVM with embedded hybrid N-gram features produced the highest average accuracy of around 83%. Among the ensemble voting-based techniques, XGboost achieved the optimal accuracy of 79%. Both implicit and explicit Roman Urdu instances were evaluated, and the categorization of severity based on prediction probabilities was performed. Time complexity is also analyzed in terms of execution time, indicating that LR, using different parameters and feature combinations, is the fastest algorithm. The results are promising with respect to standard assessment metrics and indicate the feasibility of the proposed approach in cyberbullying detection for the Roman Urdu language.
first_indexed	2024-03-11T09:12:51Z
format	Article
id	doaj.art-404e21cb461c4e02b74bc09eae6a6e2c
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T09:12:51Z
publishDate	2023-02-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-404e21cb461c4e02b74bc09eae6a6e2c2023-11-16T18:50:22ZengMDPI AGApplied Sciences2076-34172023-02-01134206210.3390/app13042062Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble TechniquesAmirita Dewani0Mohsin Ali Memon1Sania Bhatti2Adel Sulaiman3Mohammed Hamdi4Hani Alshahrani5Abdullah Alghamdi6Asadullah Shaikh7Department of Software Engineering, Institute of Information and Communication Technologies (IICT), Mehran University of Engineering and Technology, Jamshoro 76062, PakistanDepartment of Software Engineering, Institute of Information and Communication Technologies (IICT), Mehran University of Engineering and Technology, Jamshoro 76062, PakistanDepartment of Software Engineering, Institute of Information and Communication Technologies (IICT), Mehran University of Engineering and Technology, Jamshoro 76062, PakistanDepartment of Computer Science, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi ArabiaDepartment of Computer Science, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi ArabiaDepartment of Computer Science, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi ArabiaDepartment of Information Systems, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi ArabiaDepartment of Information Systems, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi ArabiaSocial media platforms have become a substratum for people to enunciate their opinions and ideas across the globe. Due to anonymity preservation and freedom of expression, it is possible to humiliate individuals and groups, disregarding social etiquette online, inevitably proliferating and diversifying the incidents of cyberbullying and cyber hate speech. This intimidating problem has recently sought the attention of researchers and scholars worldwide. Still, the current practices to sift the online content and offset the hatred spread do not go far enough. One factor contributing to this is the recent prevalence of regional languages in social media, the dearth of language resources, and flexible detection approaches, specifically for low-resource languages. In this context, most existing studies are oriented towards traditional resource-rich languages and highlight a huge gap in recently embraced resource-poor languages. One such language currently adopted worldwide and more typically by South Asian users for textual communication on social networks is Roman Urdu. It is derived from Urdu and written using a Left-to-Right pattern and Roman scripting. This language elicits numerous computational challenges while performing natural language preprocessing tasks due to its inflections, derivations, lexical variations, and morphological richness. To alleviate this problem, this research proposes a cyberbullying detection approach for analyzing textual data in the Roman Urdu language based on advanced preprocessing methods, voting-based ensemble techniques, and machine learning algorithms. The study has extracted a vast number of features, including statistical features, word N-Grams, combined n-grams, and BOW model with TFIDF weighting in different experimental settings using GridSearchCV and cross-validation techniques. The detection approach has been designed to tackle users’ textual input by considering user-specific writing styles on social media in a colloquial and non-standard form. The experimental results show that SVM with embedded hybrid N-gram features produced the highest average accuracy of around 83%. Among the ensemble voting-based techniques, XGboost achieved the optimal accuracy of 79%. Both implicit and explicit Roman Urdu instances were evaluated, and the categorization of severity based on prediction probabilities was performed. Time complexity is also analyzed in terms of execution time, indicating that LR, using different parameters and feature combinations, is the fastest algorithm. The results are promising with respect to standard assessment metrics and indicate the feasibility of the proposed approach in cyberbullying detection for the Roman Urdu language.https://www.mdpi.com/2076-3417/13/4/2062natural language processinglow resource Roman Urdu languagecyberbullying detectionmachine learningensemble learninghate speech
spellingShingle	Amirita Dewani Mohsin Ali Memon Sania Bhatti Adel Sulaiman Mohammed Hamdi Hani Alshahrani Abdullah Alghamdi Asadullah Shaikh Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques Applied Sciences natural language processing low resource Roman Urdu language cyberbullying detection machine learning ensemble learning hate speech
title	Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques
title_full	Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques
title_fullStr	Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques
title_full_unstemmed	Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques
title_short	Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques
title_sort	detection of cyberbullying patterns in low resource colloquial roman urdu microtext using natural language processing machine learning and ensemble techniques
topic	natural language processing low resource Roman Urdu language cyberbullying detection machine learning ensemble learning hate speech
url	https://www.mdpi.com/2076-3417/13/4/2062
work_keys_str_mv	AT amiritadewani detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques AT mohsinalimemon detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques AT saniabhatti detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques AT adelsulaiman detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques AT mohammedhamdi detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques AT hanialshahrani detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques AT abdullahalghamdi detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques AT asadullahshaikh detectionofcyberbullyingpatternsinlowresourcecolloquialromanurdumicrotextusingnaturallanguageprocessingmachinelearningandensembletechniques

Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques

Similar Items