Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets

In recent times, South Africa has been witnessing insurgence of offensive and hate speech along racial and ethnic dispositions on Twitter. Popular among the South African languages used is English. Although, machine learning has been successfully used to detect offensive and hate speech in several E...

Full description

Bibliographic Details
Main Authors: Oluwafemi Oriola, Eduan Kotze
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8963960/
_version_ 1818875658622730240
author Oluwafemi Oriola
Eduan Kotze
author_facet Oluwafemi Oriola
Eduan Kotze
author_sort Oluwafemi Oriola
collection DOAJ
description In recent times, South Africa has been witnessing insurgence of offensive and hate speech along racial and ethnic dispositions on Twitter. Popular among the South African languages used is English. Although, machine learning has been successfully used to detect offensive and hate speech in several English contexts, the distinctiveness of South African tweets and the similarities among offensive, hate and free speeches require domain-specific English corpus and techniques to detect the offensive and hate speech. Thus, we developed an English corpus from South African tweets and evaluated different machine learning techniques to detect offensive and hate speech. Character n-gram, word n-gram, negative sentiment, syntactic-based features and their hybrid were extracted and analyzed using hyper-parameter optimization, ensemble and multi-tier meta-learning models of support vector machine, logistic regression, random forest, gradient boosting algorithms. The results showed that optimized support vector machine with character n-gram performed best in detection of hate speech with true positive rate of 0.894, while optimized gradient boosting with word n-gram performed best in detection of hate speech with true positive rate of 0.867. However, their performances in detection of other threatening classes were poor. Multi-tier meta-learning models achieved the most consistent and balanced classification performance with true positive rates of 0.858 and 0.887 for hate speech and offensive speech, respectively as well as true positive rate of 0.646 for free speech and overall accuracy of 0.671. The error analysis showed that multi-tier meta-learning model could reduce the misclassification error rate of the optimized models by 34.26%.
first_indexed 2024-12-19T13:30:00Z
format Article
id doaj.art-2e4b5c71b94c468588c5f9b440bd21a8
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-19T13:30:00Z
publishDate 2020-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-2e4b5c71b94c468588c5f9b440bd21a82022-12-21T20:19:26ZengIEEEIEEE Access2169-35362020-01-018214962150910.1109/ACCESS.2020.29681738963960Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African TweetsOluwafemi Oriola0https://orcid.org/0000-0003-0255-6160Eduan Kotze1Department of Computer Science and Informatics, University of the Free State, Bloemfontein, South AfricaDepartment of Computer Science and Informatics, University of the Free State, Bloemfontein, South AfricaIn recent times, South Africa has been witnessing insurgence of offensive and hate speech along racial and ethnic dispositions on Twitter. Popular among the South African languages used is English. Although, machine learning has been successfully used to detect offensive and hate speech in several English contexts, the distinctiveness of South African tweets and the similarities among offensive, hate and free speeches require domain-specific English corpus and techniques to detect the offensive and hate speech. Thus, we developed an English corpus from South African tweets and evaluated different machine learning techniques to detect offensive and hate speech. Character n-gram, word n-gram, negative sentiment, syntactic-based features and their hybrid were extracted and analyzed using hyper-parameter optimization, ensemble and multi-tier meta-learning models of support vector machine, logistic regression, random forest, gradient boosting algorithms. The results showed that optimized support vector machine with character n-gram performed best in detection of hate speech with true positive rate of 0.894, while optimized gradient boosting with word n-gram performed best in detection of hate speech with true positive rate of 0.867. However, their performances in detection of other threatening classes were poor. Multi-tier meta-learning models achieved the most consistent and balanced classification performance with true positive rates of 0.858 and 0.887 for hate speech and offensive speech, respectively as well as true positive rate of 0.646 for free speech and overall accuracy of 0.671. The error analysis showed that multi-tier meta-learning model could reduce the misclassification error rate of the optimized models by 34.26%.https://ieeexplore.ieee.org/document/8963960/Machine learningSouth AfricaTwitterhate speechoffensive speech
spellingShingle Oluwafemi Oriola
Eduan Kotze
Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets
IEEE Access
Machine learning
South Africa
Twitter
hate speech
offensive speech
title Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets
title_full Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets
title_fullStr Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets
title_full_unstemmed Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets
title_short Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets
title_sort evaluating machine learning techniques for detecting offensive and hate speech in south african tweets
topic Machine learning
South Africa
Twitter
hate speech
offensive speech
url https://ieeexplore.ieee.org/document/8963960/
work_keys_str_mv AT oluwafemioriola evaluatingmachinelearningtechniquesfordetectingoffensiveandhatespeechinsouthafricantweets
AT eduankotze evaluatingmachinelearningtechniquesfordetectingoffensiveandhatespeechinsouthafricantweets