SummEval: Re-evaluating Summarization Evaluation

AbstractThe scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions:...

Full description

Bibliographic Details
Main Authors: Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev
Format: Article
Language:English
Published: The MIT Press 2021-01-01
Series:Transactions of the Association for Computational Linguistics
Online Access:https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation
_version_ 1818261811441434624
author Alexander R. Fabbri
Wojciech Kryściński
Bryan McCann
Caiming Xiong
Richard Socher
Dragomir Radev
author_facet Alexander R. Fabbri
Wojciech Kryściński
Bryan McCann
Caiming Xiong
Richard Socher
Dragomir Radev
author_sort Alexander R. Fabbri
collection DOAJ
description AbstractThe scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.
first_indexed 2024-12-12T18:53:10Z
format Article
id doaj.art-3a815d182a8440e1ac02f25f1d9da002
institution Directory Open Access Journal
issn 2307-387X
language English
last_indexed 2024-12-12T18:53:10Z
publishDate 2021-01-01
publisher The MIT Press
record_format Article
series Transactions of the Association for Computational Linguistics
spelling doaj.art-3a815d182a8440e1ac02f25f1d9da0022022-12-22T00:15:19ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2021-01-01939140910.1162/tacl_a_00373SummEval: Re-evaluating Summarization EvaluationAlexander R. Fabbri0Wojciech Kryściński1Bryan McCann2Caiming Xiong3Richard Socher4Dragomir Radev5Yale University, United States. alexander.fabbri@yale.eduSalesforce Research, United States. kryscinski@salesforce.comSalesforce Research, United States. bryan.mccann.is@gmail.comSalesforce Research, United States. cxiong@salesforce.comSalesforce Research, United States. richard@socher.orgYale University, United States AbstractThe scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation
spellingShingle Alexander R. Fabbri
Wojciech Kryściński
Bryan McCann
Caiming Xiong
Richard Socher
Dragomir Radev
SummEval: Re-evaluating Summarization Evaluation
Transactions of the Association for Computational Linguistics
title SummEval: Re-evaluating Summarization Evaluation
title_full SummEval: Re-evaluating Summarization Evaluation
title_fullStr SummEval: Re-evaluating Summarization Evaluation
title_full_unstemmed SummEval: Re-evaluating Summarization Evaluation
title_short SummEval: Re-evaluating Summarization Evaluation
title_sort summeval re evaluating summarization evaluation
url https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation
work_keys_str_mv AT alexanderrfabbri summevalreevaluatingsummarizationevaluation
AT wojciechkryscinski summevalreevaluatingsummarizationevaluation
AT bryanmccann summevalreevaluatingsummarizationevaluation
AT caimingxiong summevalreevaluatingsummarizationevaluation
AT richardsocher summevalreevaluatingsummarizationevaluation
AT dragomirradev summevalreevaluatingsummarizationevaluation