Arabic Toxic Tweet Classification: Leveraging the AraBERT Model

Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant effort...

Full description

Bibliographic Details
Main Authors: Amr Mohamed El Koshiry, Entesar Hamed I. Eliwa, Tarek Abd El-Hafeez, Ahmed Omar
Format: Article
Language:English
Published: MDPI AG 2023-10-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/7/4/170
_version_ 1827575638748299264
author Amr Mohamed El Koshiry
Entesar Hamed I. Eliwa
Tarek Abd El-Hafeez
Ahmed Omar
author_facet Amr Mohamed El Koshiry
Entesar Hamed I. Eliwa
Tarek Abd El-Hafeez
Ahmed Omar
author_sort Amr Mohamed El Koshiry
collection DOAJ
description Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures.
first_indexed 2024-03-08T20:59:53Z
format Article
id doaj.art-9d344e028cb441e0a79b6daa91fc1801
institution Directory Open Access Journal
issn 2504-2289
language English
last_indexed 2024-03-08T20:59:53Z
publishDate 2023-10-01
publisher MDPI AG
record_format Article
series Big Data and Cognitive Computing
spelling doaj.art-9d344e028cb441e0a79b6daa91fc18012023-12-22T13:53:32ZengMDPI AGBig Data and Cognitive Computing2504-22892023-10-017417010.3390/bdcc7040170Arabic Toxic Tweet Classification: Leveraging the AraBERT ModelAmr Mohamed El Koshiry0Entesar Hamed I. Eliwa1Tarek Abd El-Hafeez2Ahmed Omar3Department of Curricula and Teaching Methods, College of Education, King Faisal University, P.O. Box 400, Al-Ahsa 31982, Saudi ArabiaDepartment of Mathematics and Statistics, College of Science, King Faisal University, P.O. Box 400, Al-Ahsa 31982, Saudi ArabiaDepartment of Computer Science, Faculty of Science, Minia University, Minia 61519, EgyptDepartment of Computer Science, Faculty of Science, Minia University, Minia 61519, EgyptSocial media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures.https://www.mdpi.com/2504-2289/7/4/170Arabic toxictoxic classificationArabic NLPBERT
spellingShingle Amr Mohamed El Koshiry
Entesar Hamed I. Eliwa
Tarek Abd El-Hafeez
Ahmed Omar
Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
Big Data and Cognitive Computing
Arabic toxic
toxic classification
Arabic NLP
BERT
title Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
title_full Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
title_fullStr Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
title_full_unstemmed Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
title_short Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
title_sort arabic toxic tweet classification leveraging the arabert model
topic Arabic toxic
toxic classification
Arabic NLP
BERT
url https://www.mdpi.com/2504-2289/7/4/170
work_keys_str_mv AT amrmohamedelkoshiry arabictoxictweetclassificationleveragingthearabertmodel
AT entesarhamedieliwa arabictoxictweetclassificationleveragingthearabertmodel
AT tarekabdelhafeez arabictoxictweetclassificationleveragingthearabertmodel
AT ahmedomar arabictoxictweetclassificationleveragingthearabertmodel