Dataset of Arabic spam and ham tweets

This data article provides a dataset of 132421 posts and their corresponding information collected from Twitter social media. The data has two classes, ham or spam, where ham indicates non-spam clean tweets. The main target of this dataset is to study a way to classify whether a post is a spam or no...

Full description

Bibliographic Details
Main Authors: Sanaa Kaddoura, Safaa Henno
Format: Article
Language:English
Published: Elsevier 2024-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340923009472
_version_ 1827353697252802560
author Sanaa Kaddoura
Safaa Henno
author_facet Sanaa Kaddoura
Safaa Henno
author_sort Sanaa Kaddoura
collection DOAJ
description This data article provides a dataset of 132421 posts and their corresponding information collected from Twitter social media. The data has two classes, ham or spam, where ham indicates non-spam clean tweets. The main target of this dataset is to study a way to classify whether a post is a spam or not automatically. The data is in Arabic language only, which makes the data essential to the researchers in Arabic natural language processing (NLP) due to the lack of resources in this language. The data is made publicly available to allow researchers to use it as a benchmark for their research in Arabic NLP. The dataset was collected using the Twitter REST API between January 27, 2021, and March 10, 2021. An ad-hoc crawler was constructed using Python programming language to collect the data. Many scientists and researchers will benefit from this dataset in the domain of cybersecurity, NLP, data science and social networking analysis.
first_indexed 2024-03-08T03:30:14Z
format Article
id doaj.art-5542bfcc0aae4d998cdd9a264876563a
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-03-08T03:30:14Z
publishDate 2024-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-5542bfcc0aae4d998cdd9a264876563a2024-02-11T05:10:30ZengElsevierData in Brief2352-34092024-02-0152109904Dataset of Arabic spam and ham tweetsSanaa Kaddoura0Safaa Henno1Corresponding author.; Zayed University, Abu Dhabi, UAEZayed University, Abu Dhabi, UAEThis data article provides a dataset of 132421 posts and their corresponding information collected from Twitter social media. The data has two classes, ham or spam, where ham indicates non-spam clean tweets. The main target of this dataset is to study a way to classify whether a post is a spam or not automatically. The data is in Arabic language only, which makes the data essential to the researchers in Arabic natural language processing (NLP) due to the lack of resources in this language. The data is made publicly available to allow researchers to use it as a benchmark for their research in Arabic NLP. The dataset was collected using the Twitter REST API between January 27, 2021, and March 10, 2021. An ad-hoc crawler was constructed using Python programming language to collect the data. Many scientists and researchers will benefit from this dataset in the domain of cybersecurity, NLP, data science and social networking analysis.http://www.sciencedirect.com/science/article/pii/S2352340923009472TwitterLabelled dataClassificationMachine learningDeep learningCybersecurity
spellingShingle Sanaa Kaddoura
Safaa Henno
Dataset of Arabic spam and ham tweets
Data in Brief
Twitter
Labelled data
Classification
Machine learning
Deep learning
Cybersecurity
title Dataset of Arabic spam and ham tweets
title_full Dataset of Arabic spam and ham tweets
title_fullStr Dataset of Arabic spam and ham tweets
title_full_unstemmed Dataset of Arabic spam and ham tweets
title_short Dataset of Arabic spam and ham tweets
title_sort dataset of arabic spam and ham tweets
topic Twitter
Labelled data
Classification
Machine learning
Deep learning
Cybersecurity
url http://www.sciencedirect.com/science/article/pii/S2352340923009472
work_keys_str_mv AT sanaakaddoura datasetofarabicspamandhamtweets
AT safaahenno datasetofarabicspamandhamtweets