Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark

AbstractKnowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its...

Full description

Bibliographic Details
Main Authors: Nouha Dziri, Hannah Rashkin, Tal Linzen, David Reitter
Format: Article
Language:English
Published: The MIT Press 2022-01-01
Series:Transactions of the Association for Computational Linguistics
Online Access:https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00506/113023/Evaluating-Attribution-in-Dialogue-Systems-The
_version_ 1811244665540706304
author Nouha Dziri
Hannah Rashkin
Tal Linzen
David Reitter
author_facet Nouha Dziri
Hannah Rashkin
Tal Linzen
David Reitter
author_sort Nouha Dziri
collection DOAJ
description AbstractKnowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.
first_indexed 2024-04-12T14:29:13Z
format Article
id doaj.art-f1dfe216285941edacd890f813e1ae44
institution Directory Open Access Journal
issn 2307-387X
language English
last_indexed 2024-04-12T14:29:13Z
publishDate 2022-01-01
publisher The MIT Press
record_format Article
series Transactions of the Association for Computational Linguistics
spelling doaj.art-f1dfe216285941edacd890f813e1ae442022-12-22T03:29:22ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2022-01-01101066108310.1162/tacl_a_00506Evaluating Attribution in Dialogue Systems: The BEGIN BenchmarkNouha Dziri0Hannah Rashkin1Tal Linzen2David Reitter3University of Alberta, Canada. dziri@cs.ualberta.caGoogle Research, USA. hrashkin@google.comGoogle Research, USA. linzen@google.comGoogle Research, USA. reitter@google.com AbstractKnowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00506/113023/Evaluating-Attribution-in-Dialogue-Systems-The
spellingShingle Nouha Dziri
Hannah Rashkin
Tal Linzen
David Reitter
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Transactions of the Association for Computational Linguistics
title Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
title_full Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
title_fullStr Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
title_full_unstemmed Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
title_short Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
title_sort evaluating attribution in dialogue systems the begin benchmark
url https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00506/113023/Evaluating-Attribution-in-Dialogue-Systems-The
work_keys_str_mv AT nouhadziri evaluatingattributionindialoguesystemsthebeginbenchmark
AT hannahrashkin evaluatingattributionindialoguesystemsthebeginbenchmark
AT tallinzen evaluatingattributionindialoguesystemsthebeginbenchmark
AT davidreitter evaluatingattributionindialoguesystemsthebeginbenchmark