SummEval: Re-evaluating Summarization Evaluation

AbstractThe scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions:...

Full description

Bibliographic Details
Main Authors:	Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev
Format:	Article
Language:	English
Published:	The MIT Press 2021-01-01
Series:	Transactions of the Association for Computational Linguistics
Online Access:	https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation

_version_	1818261811441434624
author	Alexander R. Fabbri Wojciech Kryściński Bryan McCann Caiming Xiong Richard Socher Dragomir Radev
author_facet	Alexander R. Fabbri Wojciech Kryściński Bryan McCann Caiming Xiong Richard Socher Dragomir Radev
author_sort	Alexander R. Fabbri
collection	DOAJ
description	AbstractThe scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.
first_indexed	2024-12-12T18:53:10Z
format	Article
id	doaj.art-3a815d182a8440e1ac02f25f1d9da002
institution	Directory Open Access Journal
issn	2307-387X
language	English
last_indexed	2024-12-12T18:53:10Z
publishDate	2021-01-01
publisher	The MIT Press
record_format	Article
series	Transactions of the Association for Computational Linguistics
spelling	doaj.art-3a815d182a8440e1ac02f25f1d9da0022022-12-22T00:15:19ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2021-01-01939140910.1162/tacl_a_00373SummEval: Re-evaluating Summarization EvaluationAlexander R. Fabbri0Wojciech Kryściński1Bryan McCann2Caiming Xiong3Richard Socher4Dragomir Radev5Yale University, United States. alexander.fabbri@yale.eduSalesforce Research, United States. kryscinski@salesforce.comSalesforce Research, United States. bryan.mccann.is@gmail.comSalesforce Research, United States. cxiong@salesforce.comSalesforce Research, United States. richard@socher.orgYale University, United States AbstractThe scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation
spellingShingle	Alexander R. Fabbri Wojciech Kryściński Bryan McCann Caiming Xiong Richard Socher Dragomir Radev SummEval: Re-evaluating Summarization Evaluation Transactions of the Association for Computational Linguistics
title	SummEval: Re-evaluating Summarization Evaluation
title_full	SummEval: Re-evaluating Summarization Evaluation
title_fullStr	SummEval: Re-evaluating Summarization Evaluation
title_full_unstemmed	SummEval: Re-evaluating Summarization Evaluation
title_short	SummEval: Re-evaluating Summarization Evaluation
title_sort	summeval re evaluating summarization evaluation
url	https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation
work_keys_str_mv	AT alexanderrfabbri summevalreevaluatingsummarizationevaluation AT wojciechkryscinski summevalreevaluatingsummarizationevaluation AT bryanmccann summevalreevaluatingsummarizationevaluation AT caimingxiong summevalreevaluatingsummarizationevaluation AT richardsocher summevalreevaluatingsummarizationevaluation AT dragomirradev summevalreevaluatingsummarizationevaluation

SummEval: Re-evaluating Summarization Evaluation

Similar Items