AraCust: a Saudi Telecom Tweets corpus for sentiment analysis

Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper prese...

Full description

Bibliographic Details
Main Authors:	Latifah Almuqren, Alexandra Cristea
Format:	Article
Language:	English
Published:	PeerJ Inc. 2021-05-01
Series:	PeerJ Computer Science
Subjects:	Sentiment analysis Arabic Gold Standard Corpus Supervised approach
Online Access:	https://peerj.com/articles/cs-510.pdf

_version_	1818939705569312768
author	Latifah Almuqren Alexandra Cristea
author_facet	Latifah Almuqren Alexandra Cristea
author_sort	Latifah Almuqren
collection	DOAJ
description	Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust’s power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission.
first_indexed	2024-12-20T06:28:00Z
format	Article
id	doaj.art-45f0000b1e9c4080b565e70f7b091661
institution	Directory Open Access Journal
issn	2376-5992
language	English
last_indexed	2024-12-20T06:28:00Z
publishDate	2021-05-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj.art-45f0000b1e9c4080b565e70f7b0916612022-12-21T19:50:14ZengPeerJ Inc.PeerJ Computer Science2376-59922021-05-017e51010.7717/peerj-cs.510AraCust: a Saudi Telecom Tweets corpus for sentiment analysisLatifah Almuqren0Alexandra Cristea1Department of Computer Science, Durham University, Durham, United KingdomDepartment of Computer Science, Durham University, Durham, United KingdomComparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust’s power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission.https://peerj.com/articles/cs-510.pdfSentiment analysisArabicGold Standard CorpusSupervised approach
spellingShingle	Latifah Almuqren Alexandra Cristea AraCust: a Saudi Telecom Tweets corpus for sentiment analysis PeerJ Computer Science Sentiment analysis Arabic Gold Standard Corpus Supervised approach
title	AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_full	AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_fullStr	AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_full_unstemmed	AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_short	AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_sort	aracust a saudi telecom tweets corpus for sentiment analysis
topic	Sentiment analysis Arabic Gold Standard Corpus Supervised approach
url	https://peerj.com/articles/cs-510.pdf
work_keys_str_mv	AT latifahalmuqren aracustasauditelecomtweetscorpusforsentimentanalysis AT alexandracristea aracustasauditelecomtweetscorpusforsentimentanalysis

AraCust: a Saudi Telecom Tweets corpus for sentiment analysis

Similar Items