ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment

Toxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed us...

Full description

Bibliographic Details
Main Author: Mohammad Mamun Or Rashid
Format: Article
Language:English
Published: Elsevier 2022-08-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340922006138
_version_ 1811310450827067392
author Mohammad Mamun Or Rashid
author_facet Mohammad Mamun Or Rashid
author_sort Mohammad Mamun Or Rashid
collection DOAJ
description Toxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed using user-generated content from Facebook and that will cover the demographic and thematic distribution of Bangla toxic language generated on the web. Therefore, 2207590 comments have been collected, annotated, and thus extract about 1959 unique bigrams as utterances, which were considered as base-entry of a toxic language dataset. The core derivatives of the dataset are bigram-based wordlists, which are annotated inductively and divided into 08 thematic classes that give some ideas on toxicity variations found in the Bengali community. These thematic classes cover political hate speech [3] and misogynist bullies dominantly. However, these thematic labels will serve as classifiers in the text classification process through machine learning. In addition to the thematic classification labels, this dataset includes some additional features such as imprecise meanings in English, IPA transliteration, real occurrences in the source pages, spelling standards, and degree of toxicity. As this is a dataset of utterance, it has de-identified and anonymous entries and no difficulties for public disclosure. Therefore, we consider this dataset as Toxic lexicon (Toxlex) as an exhaustive wordlist that is essentially a curated value-added and analyzed dataset which can be used as classifier material to detect toxicity in social media.
first_indexed 2024-04-13T09:58:52Z
format Article
id doaj.art-c281b8adb7d344c7800ba16d541f8c9b
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-04-13T09:58:52Z
publishDate 2022-08-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-c281b8adb7d344c7800ba16d541f8c9b2022-12-22T02:51:17ZengElsevierData in Brief2352-34092022-08-0143108416ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook commentMohammad Mamun Or Rashid0Bangla Language Technology Specialist, Bangladesh Computer Council & Assistant Professor, Jahangirnagar University, Dhaka, BangladeshToxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed using user-generated content from Facebook and that will cover the demographic and thematic distribution of Bangla toxic language generated on the web. Therefore, 2207590 comments have been collected, annotated, and thus extract about 1959 unique bigrams as utterances, which were considered as base-entry of a toxic language dataset. The core derivatives of the dataset are bigram-based wordlists, which are annotated inductively and divided into 08 thematic classes that give some ideas on toxicity variations found in the Bengali community. These thematic classes cover political hate speech [3] and misogynist bullies dominantly. However, these thematic labels will serve as classifiers in the text classification process through machine learning. In addition to the thematic classification labels, this dataset includes some additional features such as imprecise meanings in English, IPA transliteration, real occurrences in the source pages, spelling standards, and degree of toxicity. As this is a dataset of utterance, it has de-identified and anonymous entries and no difficulties for public disclosure. Therefore, we consider this dataset as Toxic lexicon (Toxlex) as an exhaustive wordlist that is essentially a curated value-added and analyzed dataset which can be used as classifier material to detect toxicity in social media.http://www.sciencedirect.com/science/article/pii/S2352340922006138CyberbullyingOnline hateFacebook CommentsBengali slang
spellingShingle Mohammad Mamun Or Rashid
ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
Data in Brief
Cyberbullying
Online hate
Facebook Comments
Bengali slang
title ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_full ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_fullStr ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_full_unstemmed ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_short ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_sort toxlex bn a curated dataset of bangla toxic language derived from facebook comment
topic Cyberbullying
Online hate
Facebook Comments
Bengali slang
url http://www.sciencedirect.com/science/article/pii/S2352340922006138
work_keys_str_mv AT mohammadmamunorrashid toxlexbnacurateddatasetofbanglatoxiclanguagederivedfromfacebookcomment