The assessment of fundus image quality labeling reliability among graders with different backgrounds.
<h4>Purpose</h4>For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality.<h4>Methods</h4>Col...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2022-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0271156 |
_version_ | 1811221971127500800 |
---|---|
author | Kornélia Lenke Laurik-Feuerstein Rishav Sapahia Delia Cabrera DeBuc Gábor Márk Somfai |
author_facet | Kornélia Lenke Laurik-Feuerstein Rishav Sapahia Delia Cabrera DeBuc Gábor Márk Somfai |
author_sort | Kornélia Lenke Laurik-Feuerstein |
collection | DOAJ |
description | <h4>Purpose</h4>For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality.<h4>Methods</h4>Color fundus photographs were labeled using a Python-based tool using four image categories: excellent (E), good (G), adequate (A) and insufficient for grading (I). We enrolled 8 subjects (4 with and 4 without medical background, groups M and NM, respectively) to whom a tutorial was presented on image quality requirements. We randomly selected 200 images from a pool of 18,145 expert-labeled images (50/E, 50/G, 50/A, 50/I). The performance of the grading was timed and the agreement was assessed. An additional grading round was performed with 14 labels for a more objective analysis.<h4>Results</h4>The median time (interquartile range) for the labeling task with 4 categories was 987.8 sec (418.6) for all graders and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. Cohen's weighted kappa showed moderate agreement (0.564) when using four categories that increased to substantial (0.637) when using only three by merging the E and G groups. By the use of 14 labels, the weighted kappa values were 0.594 and 0.667 when assigning four or three categories, respectively.<h4>Conclusion</h4>Image grading with a Python-based tool seems to be a simple yet possibly efficient solution for the labeling of fundus images according to image quality that does not necessarily require medical background. Such grading can be subject to variability but could still effectively serve the robust identification of images with insufficient quality. This emphasizes the opportunity for the democratization of ML-applications among persons with both medical and non-medical background. However, simplicity of the grading system is key to successful categorization. |
first_indexed | 2024-04-12T08:09:15Z |
format | Article |
id | doaj.art-b6f58d27d3814e87ad89de38a6550395 |
institution | Directory Open Access Journal |
issn | 1932-6203 |
language | English |
last_indexed | 2024-04-12T08:09:15Z |
publishDate | 2022-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj.art-b6f58d27d3814e87ad89de38a65503952022-12-22T03:41:01ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01177e027115610.1371/journal.pone.0271156The assessment of fundus image quality labeling reliability among graders with different backgrounds.Kornélia Lenke Laurik-FeuersteinRishav SapahiaDelia Cabrera DeBucGábor Márk Somfai<h4>Purpose</h4>For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality.<h4>Methods</h4>Color fundus photographs were labeled using a Python-based tool using four image categories: excellent (E), good (G), adequate (A) and insufficient for grading (I). We enrolled 8 subjects (4 with and 4 without medical background, groups M and NM, respectively) to whom a tutorial was presented on image quality requirements. We randomly selected 200 images from a pool of 18,145 expert-labeled images (50/E, 50/G, 50/A, 50/I). The performance of the grading was timed and the agreement was assessed. An additional grading round was performed with 14 labels for a more objective analysis.<h4>Results</h4>The median time (interquartile range) for the labeling task with 4 categories was 987.8 sec (418.6) for all graders and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. Cohen's weighted kappa showed moderate agreement (0.564) when using four categories that increased to substantial (0.637) when using only three by merging the E and G groups. By the use of 14 labels, the weighted kappa values were 0.594 and 0.667 when assigning four or three categories, respectively.<h4>Conclusion</h4>Image grading with a Python-based tool seems to be a simple yet possibly efficient solution for the labeling of fundus images according to image quality that does not necessarily require medical background. Such grading can be subject to variability but could still effectively serve the robust identification of images with insufficient quality. This emphasizes the opportunity for the democratization of ML-applications among persons with both medical and non-medical background. However, simplicity of the grading system is key to successful categorization.https://doi.org/10.1371/journal.pone.0271156 |
spellingShingle | Kornélia Lenke Laurik-Feuerstein Rishav Sapahia Delia Cabrera DeBuc Gábor Márk Somfai The assessment of fundus image quality labeling reliability among graders with different backgrounds. PLoS ONE |
title | The assessment of fundus image quality labeling reliability among graders with different backgrounds. |
title_full | The assessment of fundus image quality labeling reliability among graders with different backgrounds. |
title_fullStr | The assessment of fundus image quality labeling reliability among graders with different backgrounds. |
title_full_unstemmed | The assessment of fundus image quality labeling reliability among graders with different backgrounds. |
title_short | The assessment of fundus image quality labeling reliability among graders with different backgrounds. |
title_sort | assessment of fundus image quality labeling reliability among graders with different backgrounds |
url | https://doi.org/10.1371/journal.pone.0271156 |
work_keys_str_mv | AT kornelialenkelaurikfeuerstein theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT rishavsapahia theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT deliacabreradebuc theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT gabormarksomfai theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT kornelialenkelaurikfeuerstein assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT rishavsapahia assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT deliacabreradebuc assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT gabormarksomfai assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds |