Evaluating the Performance of ChatGPT in Ophthalmology

Purpose: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the o...

Full description

Bibliographic Details
Main Authors:	Fares Antaki, MD, CM, Samir Touma, MD, CM, Daniel Milad, MD, Jonathan El-Khoury, MD, Renaud Duval, MD
Format:	Article
Language:	English
Published:	Elsevier 2023-12-01
Series:	Ophthalmology Science
Subjects:	Artificial intelligence ChatGPT Generative Pretrained Transformer Medical education Ophthalmology
Online Access:	http://www.sciencedirect.com/science/article/pii/S2666914523000568

_version_	1827400244801830912
author	Fares Antaki, MD, CM Samir Touma, MD, CM Daniel Milad, MD Jonathan El-Khoury, MD Renaud Duval, MD
author_facet	Fares Antaki, MD, CM Samir Touma, MD, CM Daniel Milad, MD Jonathan El-Khoury, MD Renaud Duval, MD
author_sort	Fares Antaki, MD, CM
collection	DOAJ
description	Purpose: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space. Design: Evaluation of diagnostic test or technology. Participants: ChatGPT is a publicly available LLM. Methods: We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey’s test to decide if there were meaningful differences between the tested subspecialties. Main Outcome Measures: We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT’s outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05. Results: The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT’s answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology (P < 0.001) and ocular pathology (P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections. Conclusion: ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties. Financial Disclosure(s): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
first_indexed	2024-03-08T19:58:14Z
format	Article
id	doaj.art-82854c2e372c43d1aaa404decff08ac6
institution	Directory Open Access Journal
issn	2666-9145
language	English
last_indexed	2024-03-08T19:58:14Z
publishDate	2023-12-01
publisher	Elsevier
record_format	Article
series	Ophthalmology Science
spelling	doaj.art-82854c2e372c43d1aaa404decff08ac62023-12-24T04:47:02ZengElsevierOphthalmology Science2666-91452023-12-0134100324Evaluating the Performance of ChatGPT in OphthalmologyFares Antaki, MD, CM0Samir Touma, MD, CM1Daniel Milad, MD2Jonathan El-Khoury, MD3Renaud Duval, MD4Department of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada; Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l’Est-de-l’Île-de-Montréal, Montréal, Quebec, Canada; Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada; The CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, CanadaDepartment of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada; Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l’Est-de-l’Île-de-Montréal, Montréal, Quebec, Canada; Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, CanadaDepartment of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada; Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l’Est-de-l’Île-de-Montréal, Montréal, Quebec, Canada; Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, CanadaDepartment of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada; Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l’Est-de-l’Île-de-Montréal, Montréal, Quebec, Canada; Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, CanadaDepartment of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada; Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l’Est-de-l’Île-de-Montréal, Montréal, Quebec, Canada; Correspondence: Renaud Duval, MD, Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, 5415 Boulevard de l'Assomption, Montréal, Québec, Canada, H1T 2M4.Purpose: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space. Design: Evaluation of diagnostic test or technology. Participants: ChatGPT is a publicly available LLM. Methods: We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey’s test to decide if there were meaningful differences between the tested subspecialties. Main Outcome Measures: We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT’s outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05. Results: The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT’s answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology (P < 0.001) and ocular pathology (P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections. Conclusion: ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties. Financial Disclosure(s): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.http://www.sciencedirect.com/science/article/pii/S2666914523000568Artificial intelligenceChatGPTGenerative Pretrained TransformerMedical educationOphthalmology
spellingShingle	Fares Antaki, MD, CM Samir Touma, MD, CM Daniel Milad, MD Jonathan El-Khoury, MD Renaud Duval, MD Evaluating the Performance of ChatGPT in Ophthalmology Ophthalmology Science Artificial intelligence ChatGPT Generative Pretrained Transformer Medical education Ophthalmology
title	Evaluating the Performance of ChatGPT in Ophthalmology
title_full	Evaluating the Performance of ChatGPT in Ophthalmology
title_fullStr	Evaluating the Performance of ChatGPT in Ophthalmology
title_full_unstemmed	Evaluating the Performance of ChatGPT in Ophthalmology
title_short	Evaluating the Performance of ChatGPT in Ophthalmology
title_sort	evaluating the performance of chatgpt in ophthalmology
topic	Artificial intelligence ChatGPT Generative Pretrained Transformer Medical education Ophthalmology
url	http://www.sciencedirect.com/science/article/pii/S2666914523000568
work_keys_str_mv	AT faresantakimdcm evaluatingtheperformanceofchatgptinophthalmology AT samirtoumamdcm evaluatingtheperformanceofchatgptinophthalmology AT danielmiladmd evaluatingtheperformanceofchatgptinophthalmology AT jonathanelkhourymd evaluatingtheperformanceofchatgptinophthalmology AT renaudduvalmd evaluatingtheperformanceofchatgptinophthalmology

Evaluating the Performance of ChatGPT in Ophthalmology

Similar Items