L-Boost: Identifying Offensive Texts From Social Media Post in Bengali

Due to the significant increase in Internet activity since the COVID-19 epidemic, many informal, unstructured, offensive, and even misspelled textual content has been used for online communication through various social media. The Bengali and Banglish(Bengali words written in English format) offensi...

Full description

Bibliographic Details
Main Authors: M. F. Mridha, Md. Anwar Hussen Wadud, Md. Abdul Hamid, Muhammad Mostafa Monowar, M. Abdullah-Al-Wadud, Atif Alamri
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9642973/
_version_ 1811209834774659072
author M. F. Mridha
Md. Anwar Hussen Wadud
Md. Abdul Hamid
Muhammad Mostafa Monowar
M. Abdullah-Al-Wadud
Atif Alamri
author_facet M. F. Mridha
Md. Anwar Hussen Wadud
Md. Abdul Hamid
Muhammad Mostafa Monowar
M. Abdullah-Al-Wadud
Atif Alamri
author_sort M. F. Mridha
collection DOAJ
description Due to the significant increase in Internet activity since the COVID-19 epidemic, many informal, unstructured, offensive, and even misspelled textual content has been used for online communication through various social media. The Bengali and Banglish(Bengali words written in English format) offensive texts have recently been widely used to harass and criticize people on various social media. Our deep excavation reveals that limited work has been done to identify offensive Bengali texts. In this study, we have engineered a detection mechanism using natural language processing to identify Bengali and Banglish offensive messages in social media that could abuse other people. First, different classifiers have been employed to classify the offensive text as baseline classifiers from real-life datasets. Then, we applied boosting algorithms based on baseline classifiers. AdaBoost is the most effective ensemble method called adaptive boosting, which enhances the outcomes of the classifiers. The long short-term memory (LSTM) model is used to eliminate long-term dependency problems when classifying text, but overfitting problems occur. AdaBoost has strong forecasting ability and overfitting problem does not occur easily. By considering these two powerful and diverse models, we propose L-Boost, the modified AdaBoost algorithm using bidirectional encoder representations from transformers (BERT) with LSTM models. We tested the L-Boost model on three separate datasets, including the BERT pre-trained word-embedding vector model. We find our proposed L-Boost’s efficacy better than all the baseline classification algorithms reaching an accuracy of 95.11%.
first_indexed 2024-04-12T04:45:44Z
format Article
id doaj.art-54468f8cfc684c3fb58f1fba6159cd76
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-12T04:45:44Z
publishDate 2021-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-54468f8cfc684c3fb58f1fba6159cd762022-12-22T03:47:30ZengIEEEIEEE Access2169-35362021-01-01916468116469910.1109/ACCESS.2021.31341549642973L-Boost: Identifying Offensive Texts From Social Media Post in BengaliM. F. Mridha0https://orcid.org/0000-0001-5738-1631Md. Anwar Hussen Wadud1https://orcid.org/0000-0002-7344-0838Md. Abdul Hamid2https://orcid.org/0000-0001-9698-4726Muhammad Mostafa Monowar3https://orcid.org/0000-0003-2822-2572M. Abdullah-Al-Wadud4https://orcid.org/0000-0001-6767-3574Atif Alamri5https://orcid.org/0000-0002-1887-5193Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, BangladeshDepartment of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, BangladeshDepartment of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaDepartment of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaResearch Chair of Pervasive and Mobile Computing, King Saud University, Riyadh, Saudi ArabiaResearch Chair of Pervasive and Mobile Computing, King Saud University, Riyadh, Saudi ArabiaDue to the significant increase in Internet activity since the COVID-19 epidemic, many informal, unstructured, offensive, and even misspelled textual content has been used for online communication through various social media. The Bengali and Banglish(Bengali words written in English format) offensive texts have recently been widely used to harass and criticize people on various social media. Our deep excavation reveals that limited work has been done to identify offensive Bengali texts. In this study, we have engineered a detection mechanism using natural language processing to identify Bengali and Banglish offensive messages in social media that could abuse other people. First, different classifiers have been employed to classify the offensive text as baseline classifiers from real-life datasets. Then, we applied boosting algorithms based on baseline classifiers. AdaBoost is the most effective ensemble method called adaptive boosting, which enhances the outcomes of the classifiers. The long short-term memory (LSTM) model is used to eliminate long-term dependency problems when classifying text, but overfitting problems occur. AdaBoost has strong forecasting ability and overfitting problem does not occur easily. By considering these two powerful and diverse models, we propose L-Boost, the modified AdaBoost algorithm using bidirectional encoder representations from transformers (BERT) with LSTM models. We tested the L-Boost model on three separate datasets, including the BERT pre-trained word-embedding vector model. We find our proposed L-Boost’s efficacy better than all the baseline classification algorithms reaching an accuracy of 95.11%.https://ieeexplore.ieee.org/document/9642973/Offensive textsocial media harassmentnatural language processingensemble learningBERT model
spellingShingle M. F. Mridha
Md. Anwar Hussen Wadud
Md. Abdul Hamid
Muhammad Mostafa Monowar
M. Abdullah-Al-Wadud
Atif Alamri
L-Boost: Identifying Offensive Texts From Social Media Post in Bengali
IEEE Access
Offensive text
social media harassment
natural language processing
ensemble learning
BERT model
title L-Boost: Identifying Offensive Texts From Social Media Post in Bengali
title_full L-Boost: Identifying Offensive Texts From Social Media Post in Bengali
title_fullStr L-Boost: Identifying Offensive Texts From Social Media Post in Bengali
title_full_unstemmed L-Boost: Identifying Offensive Texts From Social Media Post in Bengali
title_short L-Boost: Identifying Offensive Texts From Social Media Post in Bengali
title_sort l boost identifying offensive texts from social media post in bengali
topic Offensive text
social media harassment
natural language processing
ensemble learning
BERT model
url https://ieeexplore.ieee.org/document/9642973/
work_keys_str_mv AT mfmridha lboostidentifyingoffensivetextsfromsocialmediapostinbengali
AT mdanwarhussenwadud lboostidentifyingoffensivetextsfromsocialmediapostinbengali
AT mdabdulhamid lboostidentifyingoffensivetextsfromsocialmediapostinbengali
AT muhammadmostafamonowar lboostidentifyingoffensivetextsfromsocialmediapostinbengali
AT mabdullahalwadud lboostidentifyingoffensivetextsfromsocialmediapostinbengali
AT atifalamri lboostidentifyingoffensivetextsfromsocialmediapostinbengali