Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

Information Retrieval (IR) is a discipline deeply rooted in evaluation since its inception. Indeed, experimentally measuring and statistically validating the performance of IR systems are the only possible ways to compare systems and understand which are better than others and, ultimately, more effe...

Full description

Bibliographic Details
Main Authors:	Marco Ferrante, Nicola Ferro, Norbert Fuhr
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Information retrieval measurement software metrics statistical analysis experimental evaluation retrieval effectiveness
Online Access:	https://ieeexplore.ieee.org/document/9552926/

_version_	1811235072200671232
author	Marco Ferrante Nicola Ferro Norbert Fuhr
author_facet	Marco Ferrante Nicola Ferro Norbert Fuhr
author_sort	Marco Ferrante
collection	DOAJ
description	Information Retrieval (IR) is a discipline deeply rooted in evaluation since its inception. Indeed, experimentally measuring and statistically validating the performance of IR systems are the only possible ways to compare systems and understand which are better than others and, ultimately, more effective and useful for end-users. Since the seminal paper by Stevens (1946), it is known that the properties of the measurement scales determine the operations you should or should not perform with values from those scales. For example, Stevens suggested that you can compute means and variances only when you are working with, at least, interval scales. It was recently shown that the most popular evaluation measures in IR are not interval-scaled. However, so far, there has been little or no investigation in IR on the impact and consequences of departing from scale assumptions. Taken to the extremes, it might even mean that decades of experimental IR research used potentially improper methods, which may have produced results needing further validation. However, it was unclear if and to what extent these findings apply to actual evaluations; this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of <italic>meaningfulness</italic> specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic. Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between the original measures and the interval-scaled ones. For all the regarded measures – namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank - we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but on average, we observed a 25% change in the decision about which systems are significantly different and which are not. These experimental findings further support the idea that measurement scales matter and that departing from their assumptions has an impact. This not only suggests that, to the extent possible, it would be better to comply with such assumptions but it also urges us to clearly indicate when we depart from such assumptions and, carefully, point out the limitations of the conclusions we draw and under which conditions they are drawn.
first_indexed	2024-04-12T11:46:00Z
format	Article
id	doaj.art-3da290b3e33246528404cb44be2d10ae
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-12T11:46:00Z
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-3da290b3e33246528404cb44be2d10ae2022-12-22T03:34:20ZengIEEEIEEE Access2169-35362021-01-01913618213621610.1109/ACCESS.2021.31168579552926Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval ScalesMarco Ferrante0https://orcid.org/0000-0002-0894-4175Nicola Ferro1https://orcid.org/0000-0001-9219-6239Norbert Fuhr2https://orcid.org/0000-0002-0441-6949Department of Mathematics ‘‘Tullio Levi-Civita,’’, University of Padua, Padua, ItalyDepartment of Information Engineering, University of Padua, Padua, ItalyFaculty of Engineering, University of Duisburg–Essen, Duisburg, GermanyInformation Retrieval (IR) is a discipline deeply rooted in evaluation since its inception. Indeed, experimentally measuring and statistically validating the performance of IR systems are the only possible ways to compare systems and understand which are better than others and, ultimately, more effective and useful for end-users. Since the seminal paper by Stevens (1946), it is known that the properties of the measurement scales determine the operations you should or should not perform with values from those scales. For example, Stevens suggested that you can compute means and variances only when you are working with, at least, interval scales. It was recently shown that the most popular evaluation measures in IR are not interval-scaled. However, so far, there has been little or no investigation in IR on the impact and consequences of departing from scale assumptions. Taken to the extremes, it might even mean that decades of experimental IR research used potentially improper methods, which may have produced results needing further validation. However, it was unclear if and to what extent these findings apply to actual evaluations; this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of <italic>meaningfulness</italic> specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic. Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between the original measures and the interval-scaled ones. For all the regarded measures – namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank - we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but on average, we observed a 25% change in the decision about which systems are significantly different and which are not. These experimental findings further support the idea that measurement scales matter and that departing from their assumptions has an impact. This not only suggests that, to the extent possible, it would be better to comply with such assumptions but it also urges us to clearly indicate when we depart from such assumptions and, carefully, point out the limitations of the conclusions we draw and under which conditions they are drawn.https://ieeexplore.ieee.org/document/9552926/Information retrievalmeasurementsoftware metricsstatistical analysisexperimental evaluationretrieval effectiveness
spellingShingle	Marco Ferrante Nicola Ferro Norbert Fuhr Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales IEEE Access Information retrieval measurement software metrics statistical analysis experimental evaluation retrieval effectiveness
title	Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales
title_full	Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales
title_fullStr	Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales
title_full_unstemmed	Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales
title_short	Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales
title_sort	towards meaningful statements in ir evaluation mapping evaluation measures to interval scales
topic	Information retrieval measurement software metrics statistical analysis experimental evaluation retrieval effectiveness
url	https://ieeexplore.ieee.org/document/9552926/
work_keys_str_mv	AT marcoferrante towardsmeaningfulstatementsinirevaluationmappingevaluationmeasurestointervalscales AT nicolaferro towardsmeaningfulstatementsinirevaluationmappingevaluationmeasurestointervalscales AT norbertfuhr towardsmeaningfulstatementsinirevaluationmappingevaluationmeasurestointervalscales

Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

Similar Items