Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant effort...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-10-01
|
Series: | Big Data and Cognitive Computing |
Subjects: | |
Online Access: | https://www.mdpi.com/2504-2289/7/4/170 |
_version_ | 1827575638748299264 |
---|---|
author | Amr Mohamed El Koshiry Entesar Hamed I. Eliwa Tarek Abd El-Hafeez Ahmed Omar |
author_facet | Amr Mohamed El Koshiry Entesar Hamed I. Eliwa Tarek Abd El-Hafeez Ahmed Omar |
author_sort | Amr Mohamed El Koshiry |
collection | DOAJ |
description | Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures. |
first_indexed | 2024-03-08T20:59:53Z |
format | Article |
id | doaj.art-9d344e028cb441e0a79b6daa91fc1801 |
institution | Directory Open Access Journal |
issn | 2504-2289 |
language | English |
last_indexed | 2024-03-08T20:59:53Z |
publishDate | 2023-10-01 |
publisher | MDPI AG |
record_format | Article |
series | Big Data and Cognitive Computing |
spelling | doaj.art-9d344e028cb441e0a79b6daa91fc18012023-12-22T13:53:32ZengMDPI AGBig Data and Cognitive Computing2504-22892023-10-017417010.3390/bdcc7040170Arabic Toxic Tweet Classification: Leveraging the AraBERT ModelAmr Mohamed El Koshiry0Entesar Hamed I. Eliwa1Tarek Abd El-Hafeez2Ahmed Omar3Department of Curricula and Teaching Methods, College of Education, King Faisal University, P.O. Box 400, Al-Ahsa 31982, Saudi ArabiaDepartment of Mathematics and Statistics, College of Science, King Faisal University, P.O. Box 400, Al-Ahsa 31982, Saudi ArabiaDepartment of Computer Science, Faculty of Science, Minia University, Minia 61519, EgyptDepartment of Computer Science, Faculty of Science, Minia University, Minia 61519, EgyptSocial media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures.https://www.mdpi.com/2504-2289/7/4/170Arabic toxictoxic classificationArabic NLPBERT |
spellingShingle | Amr Mohamed El Koshiry Entesar Hamed I. Eliwa Tarek Abd El-Hafeez Ahmed Omar Arabic Toxic Tweet Classification: Leveraging the AraBERT Model Big Data and Cognitive Computing Arabic toxic toxic classification Arabic NLP BERT |
title | Arabic Toxic Tweet Classification: Leveraging the AraBERT Model |
title_full | Arabic Toxic Tweet Classification: Leveraging the AraBERT Model |
title_fullStr | Arabic Toxic Tweet Classification: Leveraging the AraBERT Model |
title_full_unstemmed | Arabic Toxic Tweet Classification: Leveraging the AraBERT Model |
title_short | Arabic Toxic Tweet Classification: Leveraging the AraBERT Model |
title_sort | arabic toxic tweet classification leveraging the arabert model |
topic | Arabic toxic toxic classification Arabic NLP BERT |
url | https://www.mdpi.com/2504-2289/7/4/170 |
work_keys_str_mv | AT amrmohamedelkoshiry arabictoxictweetclassificationleveragingthearabertmodel AT entesarhamedieliwa arabictoxictweetclassificationleveragingthearabertmodel AT tarekabdelhafeez arabictoxictweetclassificationleveragingthearabertmodel AT ahmedomar arabictoxictweetclassificationleveragingthearabertmodel |