Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus Photographs

Purpose: Significant visual impairment due to glaucoma is largely caused by the disease being detected too late. Objective: To build a labeled data set for training artificial intelligence (AI) algorithms for glaucoma screening by fundus photography, to assess the accuracy of the graders, and to cha...

Full description

Bibliographic Details
Main Authors: Hans G. Lemij, MD, PhD, Coen de Vente, MSc, Clara I. Sánchez, PhD, Koen A. Vermeer, PhD
Format: Article
Language:English
Published: Elsevier 2023-09-01
Series:Ophthalmology Science
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666914523000325
_version_ 1797671405783875584
author Hans G. Lemij, MD, PhD
Coen de Vente, MSc
Clara I. Sánchez, PhD
Koen A. Vermeer, PhD
author_facet Hans G. Lemij, MD, PhD
Coen de Vente, MSc
Clara I. Sánchez, PhD
Koen A. Vermeer, PhD
author_sort Hans G. Lemij, MD, PhD
collection DOAJ
description Purpose: Significant visual impairment due to glaucoma is largely caused by the disease being detected too late. Objective: To build a labeled data set for training artificial intelligence (AI) algorithms for glaucoma screening by fundus photography, to assess the accuracy of the graders, and to characterize the features of all eyes with referable glaucoma (RG). Design: Cross-sectional study. Subjects: Color fundus photographs (CFPs) of 113 893 eyes of 60 357 individuals were obtained from EyePACS, California, United States, from a population screening program for diabetic retinopathy. Methods: Carefully selected graders (ophthalmologists and optometrists) graded the images. To qualify, they had to pass the European Optic Disc Assessment Trial optic disc assessment with ≥ 85% accuracy and 92% specificity. Of 90 candidates, 30 passed. Each image of the EyePACS set was then scored by varying random pairs of graders as “RG,” “no referable glaucoma (NRG),” or ''ungradable (UG).” In case of disagreement, a glaucoma specialist made the final grading. Referable glaucoma was scored if visual field damage was expected. In case of RG, graders were instructed to mark up to 10 relevant glaucomatous features. Main Outcome Measures: Qualitative features in eyes with RG. Results: The performance of each grader was monitored; if the sensitivity and specificity dropped below 80% and 95%, respectively (the final grade served as reference), they exited the study and their gradings were redone by other graders. In all, 20 graders qualified; their mean sensitivity and specificity (standard deviation [SD]) were 85.6% (5.7) and 96.1% (2.8), respectively. The 2 graders agreed in 92.45% of the images (Gwet’s AC2, expressing the inter-rater reliability, was 0.917). Of all gradings, the sensitivity and specificity (95% confidence interval) were 86.0 (85.2–86.7)% and 96.4 (96.3–96.5)%, respectively. Of all gradable eyes (n = 111 183; 97.62%) the prevalence of RG was 4.38%. The most common features of RG were the appearance of the neuroretinal rim (NRR) inferiorly and superiorly. Conclusions: A large data set of CFPs was put together of sufficient quality to develop AI screening solutions for glaucoma. The most common features of RG were the appearance of the NRR inferiorly and superiorly. Disc hemorrhages were a rare feature of RG. Financial Disclosure(s): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
first_indexed 2024-03-11T21:15:03Z
format Article
id doaj.art-82d26916e8b447359777092e617ddfba
institution Directory Open Access Journal
issn 2666-9145
language English
last_indexed 2024-03-11T21:15:03Z
publishDate 2023-09-01
publisher Elsevier
record_format Article
series Ophthalmology Science
spelling doaj.art-82d26916e8b447359777092e617ddfba2023-09-29T04:45:21ZengElsevierOphthalmology Science2666-91452023-09-0133100300Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus PhotographsHans G. Lemij, MD, PhD0Coen de Vente, MSc1Clara I. Sánchez, PhD2Koen A. Vermeer, PhD3The Rotterdam Eye Hospital, Rotterdam, the Netherlands; Correspondence: Hans G. Lemij, The Rotterdam Eye Hospital, Rotterdam, the Netherlands.Quantitative Healthcare Analysis (QurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, the Netherlands; Department of Biomedical Engineering and Physics, Amsterdam UMC, Amsterdam, the NetherlandsQuantitative Healthcare Analysis (QurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, the Netherlands; Department of Biomedical Engineering and Physics, Amsterdam UMC, Amsterdam, the NetherlandsThe Rotterdam Ophthalmic Institute, Rotterdam Eye Hospital, Rotterdam, the NetherlandsPurpose: Significant visual impairment due to glaucoma is largely caused by the disease being detected too late. Objective: To build a labeled data set for training artificial intelligence (AI) algorithms for glaucoma screening by fundus photography, to assess the accuracy of the graders, and to characterize the features of all eyes with referable glaucoma (RG). Design: Cross-sectional study. Subjects: Color fundus photographs (CFPs) of 113 893 eyes of 60 357 individuals were obtained from EyePACS, California, United States, from a population screening program for diabetic retinopathy. Methods: Carefully selected graders (ophthalmologists and optometrists) graded the images. To qualify, they had to pass the European Optic Disc Assessment Trial optic disc assessment with ≥ 85% accuracy and 92% specificity. Of 90 candidates, 30 passed. Each image of the EyePACS set was then scored by varying random pairs of graders as “RG,” “no referable glaucoma (NRG),” or ''ungradable (UG).” In case of disagreement, a glaucoma specialist made the final grading. Referable glaucoma was scored if visual field damage was expected. In case of RG, graders were instructed to mark up to 10 relevant glaucomatous features. Main Outcome Measures: Qualitative features in eyes with RG. Results: The performance of each grader was monitored; if the sensitivity and specificity dropped below 80% and 95%, respectively (the final grade served as reference), they exited the study and their gradings were redone by other graders. In all, 20 graders qualified; their mean sensitivity and specificity (standard deviation [SD]) were 85.6% (5.7) and 96.1% (2.8), respectively. The 2 graders agreed in 92.45% of the images (Gwet’s AC2, expressing the inter-rater reliability, was 0.917). Of all gradings, the sensitivity and specificity (95% confidence interval) were 86.0 (85.2–86.7)% and 96.4 (96.3–96.5)%, respectively. Of all gradable eyes (n = 111 183; 97.62%) the prevalence of RG was 4.38%. The most common features of RG were the appearance of the neuroretinal rim (NRR) inferiorly and superiorly. Conclusions: A large data set of CFPs was put together of sufficient quality to develop AI screening solutions for glaucoma. The most common features of RG were the appearance of the NRR inferiorly and superiorly. Disc hemorrhages were a rare feature of RG. Financial Disclosure(s): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.http://www.sciencedirect.com/science/article/pii/S2666914523000325Artificial intelligenceClinical featurescolor fundus photographsglaucoma screeninglabeled data set
spellingShingle Hans G. Lemij, MD, PhD
Coen de Vente, MSc
Clara I. Sánchez, PhD
Koen A. Vermeer, PhD
Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus Photographs
Ophthalmology Science
Artificial intelligence
Clinical features
color fundus photographs
glaucoma screening
labeled data set
title Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus Photographs
title_full Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus Photographs
title_fullStr Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus Photographs
title_full_unstemmed Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus Photographs
title_short Characteristics of a Large, Labeled Data Set for the Training of Artificial Intelligence for Glaucoma Screening with Fundus Photographs
title_sort characteristics of a large labeled data set for the training of artificial intelligence for glaucoma screening with fundus photographs
topic Artificial intelligence
Clinical features
color fundus photographs
glaucoma screening
labeled data set
url http://www.sciencedirect.com/science/article/pii/S2666914523000325
work_keys_str_mv AT hansglemijmdphd characteristicsofalargelabeleddatasetforthetrainingofartificialintelligenceforglaucomascreeningwithfundusphotographs
AT coendeventemsc characteristicsofalargelabeleddatasetforthetrainingofartificialintelligenceforglaucomascreeningwithfundusphotographs
AT claraisanchezphd characteristicsofalargelabeleddatasetforthetrainingofartificialintelligenceforglaucomascreeningwithfundusphotographs
AT koenavermeerphd characteristicsofalargelabeleddatasetforthetrainingofartificialintelligenceforglaucomascreeningwithfundusphotographs