HateCheck: functional tests for hate speech detection models

Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify spe...

Full description

Bibliographic Details
Main Authors: Röttger, P, Vidgen, B, Dong, N, Waseem, Z, Margetts, H, Pierrehumbert, JB
Format: Conference item
Language:English
Published: Association for Computational Linguistics 2021
_version_ 1826311950786625536
author Röttger, P
Vidgen, B
Dong, N
Waseem, Z
Margetts, H
Pierrehumbert, JB
author_facet Röttger, P
Vidgen, B
Dong, N
Waseem, Z
Margetts, H
Pierrehumbert, JB
author_sort Röttger, P
collection OXFORD
description Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.
first_indexed 2024-03-07T08:18:51Z
format Conference item
id oxford-uuid:87f02544-2770-4d4a-a143-c1be682e7f95
institution University of Oxford
language English
last_indexed 2024-03-07T08:18:51Z
publishDate 2021
publisher Association for Computational Linguistics
record_format dspace
spelling oxford-uuid:87f02544-2770-4d4a-a143-c1be682e7f952024-01-16T14:24:54ZHateCheck: functional tests for hate speech detection modelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:87f02544-2770-4d4a-a143-c1be682e7f95EnglishSymplectic ElementsAssociation for Computational Linguistics2021Röttger, PVidgen, BDong, NWaseem, ZMargetts, HPierrehumbert, JBDetecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.
spellingShingle Röttger, P
Vidgen, B
Dong, N
Waseem, Z
Margetts, H
Pierrehumbert, JB
HateCheck: functional tests for hate speech detection models
title HateCheck: functional tests for hate speech detection models
title_full HateCheck: functional tests for hate speech detection models
title_fullStr HateCheck: functional tests for hate speech detection models
title_full_unstemmed HateCheck: functional tests for hate speech detection models
title_short HateCheck: functional tests for hate speech detection models
title_sort hatecheck functional tests for hate speech detection models
work_keys_str_mv AT rottgerp hatecheckfunctionaltestsforhatespeechdetectionmodels
AT vidgenb hatecheckfunctionaltestsforhatespeechdetectionmodels
AT dongn hatecheckfunctionaltestsforhatespeechdetectionmodels
AT waseemz hatecheckfunctionaltestsforhatespeechdetectionmodels
AT margettsh hatecheckfunctionaltestsforhatespeechdetectionmodels
AT pierrehumbertjb hatecheckfunctionaltestsforhatespeechdetectionmodels