Threatening Language Detection and Target Identification in Urdu Tweets

Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to f...

Full description

Bibliographic Details
Main Authors: Maaz Amjad, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, Alexander Gelbukh
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9536729/
Description
Summary:Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>-gram counts or word <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>-gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>-gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.
ISSN:2169-3536