Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds

Abstract Background Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary “rule...

Full description

Bibliographic Details
Main Authors: Dylan T. Beckler, Zachary C. Thumser, Jonathon S. Schofield, Paul D. Marasco
Format: Article
Language:English
Published: BMC 2018-11-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12874-018-0606-7
_version_ 1818042926649835520
author Dylan T. Beckler
Zachary C. Thumser
Jonathon S. Schofield
Paul D. Marasco
author_facet Dylan T. Beckler
Zachary C. Thumser
Jonathon S. Schofield
Paul D. Marasco
author_sort Dylan T. Beckler
collection DOAJ
description Abstract Background Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary “rules of thumb” or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability. Methods Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff’s alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement. Results We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff’s alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff’s alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff’s alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error. Conclusions We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted “rule of thumb” cutoff for Krippendorff’s alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.
first_indexed 2024-12-10T08:54:05Z
format Article
id doaj.art-a52b841fb1ed408896e3c9f16db2f90f
institution Directory Open Access Journal
issn 1471-2288
language English
last_indexed 2024-12-10T08:54:05Z
publishDate 2018-11-01
publisher BMC
record_format Article
series BMC Medical Research Methodology
spelling doaj.art-a52b841fb1ed408896e3c9f16db2f90f2022-12-22T01:55:29ZengBMCBMC Medical Research Methodology1471-22882018-11-0118111210.1186/s12874-018-0606-7Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholdsDylan T. Beckler0Zachary C. Thumser1Jonathon S. Schofield2Paul D. Marasco3Laboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland ClinicLaboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland ClinicLaboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland ClinicLaboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland ClinicAbstract Background Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary “rules of thumb” or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability. Methods Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff’s alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement. Results We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff’s alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff’s alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff’s alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error. Conclusions We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted “rule of thumb” cutoff for Krippendorff’s alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.http://link.springer.com/article/10.1186/s12874-018-0606-7Inter-raterInter-evaluatorReliabilityAgreementKrippendorff’s alphaIndex of reliability
spellingShingle Dylan T. Beckler
Zachary C. Thumser
Jonathon S. Schofield
Paul D. Marasco
Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
BMC Medical Research Methodology
Inter-rater
Inter-evaluator
Reliability
Agreement
Krippendorff’s alpha
Index of reliability
title Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_full Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_fullStr Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_full_unstemmed Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_short Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_sort reliability in evaluator based tests using simulation constructed models to determine contextually relevant agreement thresholds
topic Inter-rater
Inter-evaluator
Reliability
Agreement
Krippendorff’s alpha
Index of reliability
url http://link.springer.com/article/10.1186/s12874-018-0606-7
work_keys_str_mv AT dylantbeckler reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds
AT zacharycthumser reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds
AT jonathonsschofield reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds
AT pauldmarasco reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds