The assessment of fundus image quality labeling reliability among graders with different backgrounds.

<h4>Purpose</h4>For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality.<h4>Methods</h4>Col...

Full description

Bibliographic Details
Main Authors: Kornélia Lenke Laurik-Feuerstein, Rishav Sapahia, Delia Cabrera DeBuc, Gábor Márk Somfai
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0271156
_version_ 1811221971127500800
author Kornélia Lenke Laurik-Feuerstein
Rishav Sapahia
Delia Cabrera DeBuc
Gábor Márk Somfai
author_facet Kornélia Lenke Laurik-Feuerstein
Rishav Sapahia
Delia Cabrera DeBuc
Gábor Márk Somfai
author_sort Kornélia Lenke Laurik-Feuerstein
collection DOAJ
description <h4>Purpose</h4>For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality.<h4>Methods</h4>Color fundus photographs were labeled using a Python-based tool using four image categories: excellent (E), good (G), adequate (A) and insufficient for grading (I). We enrolled 8 subjects (4 with and 4 without medical background, groups M and NM, respectively) to whom a tutorial was presented on image quality requirements. We randomly selected 200 images from a pool of 18,145 expert-labeled images (50/E, 50/G, 50/A, 50/I). The performance of the grading was timed and the agreement was assessed. An additional grading round was performed with 14 labels for a more objective analysis.<h4>Results</h4>The median time (interquartile range) for the labeling task with 4 categories was 987.8 sec (418.6) for all graders and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. Cohen's weighted kappa showed moderate agreement (0.564) when using four categories that increased to substantial (0.637) when using only three by merging the E and G groups. By the use of 14 labels, the weighted kappa values were 0.594 and 0.667 when assigning four or three categories, respectively.<h4>Conclusion</h4>Image grading with a Python-based tool seems to be a simple yet possibly efficient solution for the labeling of fundus images according to image quality that does not necessarily require medical background. Such grading can be subject to variability but could still effectively serve the robust identification of images with insufficient quality. This emphasizes the opportunity for the democratization of ML-applications among persons with both medical and non-medical background. However, simplicity of the grading system is key to successful categorization.
first_indexed 2024-04-12T08:09:15Z
format Article
id doaj.art-b6f58d27d3814e87ad89de38a6550395
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-12T08:09:15Z
publishDate 2022-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-b6f58d27d3814e87ad89de38a65503952022-12-22T03:41:01ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01177e027115610.1371/journal.pone.0271156The assessment of fundus image quality labeling reliability among graders with different backgrounds.Kornélia Lenke Laurik-FeuersteinRishav SapahiaDelia Cabrera DeBucGábor Márk Somfai<h4>Purpose</h4>For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality.<h4>Methods</h4>Color fundus photographs were labeled using a Python-based tool using four image categories: excellent (E), good (G), adequate (A) and insufficient for grading (I). We enrolled 8 subjects (4 with and 4 without medical background, groups M and NM, respectively) to whom a tutorial was presented on image quality requirements. We randomly selected 200 images from a pool of 18,145 expert-labeled images (50/E, 50/G, 50/A, 50/I). The performance of the grading was timed and the agreement was assessed. An additional grading round was performed with 14 labels for a more objective analysis.<h4>Results</h4>The median time (interquartile range) for the labeling task with 4 categories was 987.8 sec (418.6) for all graders and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. Cohen's weighted kappa showed moderate agreement (0.564) when using four categories that increased to substantial (0.637) when using only three by merging the E and G groups. By the use of 14 labels, the weighted kappa values were 0.594 and 0.667 when assigning four or three categories, respectively.<h4>Conclusion</h4>Image grading with a Python-based tool seems to be a simple yet possibly efficient solution for the labeling of fundus images according to image quality that does not necessarily require medical background. Such grading can be subject to variability but could still effectively serve the robust identification of images with insufficient quality. This emphasizes the opportunity for the democratization of ML-applications among persons with both medical and non-medical background. However, simplicity of the grading system is key to successful categorization.https://doi.org/10.1371/journal.pone.0271156
spellingShingle Kornélia Lenke Laurik-Feuerstein
Rishav Sapahia
Delia Cabrera DeBuc
Gábor Márk Somfai
The assessment of fundus image quality labeling reliability among graders with different backgrounds.
PLoS ONE
title The assessment of fundus image quality labeling reliability among graders with different backgrounds.
title_full The assessment of fundus image quality labeling reliability among graders with different backgrounds.
title_fullStr The assessment of fundus image quality labeling reliability among graders with different backgrounds.
title_full_unstemmed The assessment of fundus image quality labeling reliability among graders with different backgrounds.
title_short The assessment of fundus image quality labeling reliability among graders with different backgrounds.
title_sort assessment of fundus image quality labeling reliability among graders with different backgrounds
url https://doi.org/10.1371/journal.pone.0271156
work_keys_str_mv AT kornelialenkelaurikfeuerstein theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds
AT rishavsapahia theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds
AT deliacabreradebuc theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds
AT gabormarksomfai theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds
AT kornelialenkelaurikfeuerstein assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds
AT rishavsapahia assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds
AT deliacabreradebuc assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds
AT gabormarksomfai assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds