Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
AbstractKnowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
The MIT Press
2022-01-01
|
Series: | Transactions of the Association for Computational Linguistics |
Online Access: | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00506/113023/Evaluating-Attribution-in-Dialogue-Systems-The |
_version_ | 1811244665540706304 |
---|---|
author | Nouha Dziri Hannah Rashkin Tal Linzen David Reitter |
author_facet | Nouha Dziri Hannah Rashkin Tal Linzen David Reitter |
author_sort | Nouha Dziri |
collection | DOAJ |
description |
AbstractKnowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset. |
first_indexed | 2024-04-12T14:29:13Z |
format | Article |
id | doaj.art-f1dfe216285941edacd890f813e1ae44 |
institution | Directory Open Access Journal |
issn | 2307-387X |
language | English |
last_indexed | 2024-04-12T14:29:13Z |
publishDate | 2022-01-01 |
publisher | The MIT Press |
record_format | Article |
series | Transactions of the Association for Computational Linguistics |
spelling | doaj.art-f1dfe216285941edacd890f813e1ae442022-12-22T03:29:22ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2022-01-01101066108310.1162/tacl_a_00506Evaluating Attribution in Dialogue Systems: The BEGIN BenchmarkNouha Dziri0Hannah Rashkin1Tal Linzen2David Reitter3University of Alberta, Canada. dziri@cs.ualberta.caGoogle Research, USA. hrashkin@google.comGoogle Research, USA. linzen@google.comGoogle Research, USA. reitter@google.com AbstractKnowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00506/113023/Evaluating-Attribution-in-Dialogue-Systems-The |
spellingShingle | Nouha Dziri Hannah Rashkin Tal Linzen David Reitter Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark Transactions of the Association for Computational Linguistics |
title | Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark |
title_full | Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark |
title_fullStr | Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark |
title_full_unstemmed | Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark |
title_short | Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark |
title_sort | evaluating attribution in dialogue systems the begin benchmark |
url | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00506/113023/Evaluating-Attribution-in-Dialogue-Systems-The |
work_keys_str_mv | AT nouhadziri evaluatingattributionindialoguesystemsthebeginbenchmark AT hannahrashkin evaluatingattributionindialoguesystemsthebeginbenchmark AT tallinzen evaluatingattributionindialoguesystemsthebeginbenchmark AT davidreitter evaluatingattributionindialoguesystemsthebeginbenchmark |