Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in context

Summary: Background: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek i...

Full description

Bibliographic Details
Main Authors: Zhi Wei Lim, Krithi Pushpanathan, Samantha Min Er Yew, Yien Lai, Chen-Hsin Sun, Janice Sing Harn Lam, David Ziyou Chen, Jocelyn Hui Lin Goh, Marcus Chun Jin Tan, Bin Sheng, Ching-Yu Cheng, Victor Teck Chang Koh, Yih-Chung Tham
Format: Article
Language:English
Published: Elsevier 2023-09-01
Series:EBioMedicine
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352396423003365
_version_ 1797738020983537664
author Zhi Wei Lim
Krithi Pushpanathan
Samantha Min Er Yew
Yien Lai
Chen-Hsin Sun
Janice Sing Harn Lam
David Ziyou Chen
Jocelyn Hui Lin Goh
Marcus Chun Jin Tan
Bin Sheng
Ching-Yu Cheng
Victor Teck Chang Koh
Yih-Chung Tham
author_facet Zhi Wei Lim
Krithi Pushpanathan
Samantha Min Er Yew
Yien Lai
Chen-Hsin Sun
Janice Sing Harn Lam
David Ziyou Chen
Jocelyn Hui Lin Goh
Marcus Chun Jin Tan
Bin Sheng
Ching-Yu Cheng
Victor Teck Chang Koh
Yih-Chung Tham
author_sort Zhi Wei Lim
collection DOAJ
description Summary: Background: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries. Methods: We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains—pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. ‘Good’ rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, ‘poor’ rated responses were further prompted for self-correction and then re-evaluated for accuracy. Findings: ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as ‘good’, compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for ‘treatment and prevention’. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% ‘good’ ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p ≤ 0.001). Interpretation: Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs’ accuracy remain crucial. Funding: Dr Yih-Chung Tham was supported by the National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001).
first_indexed 2024-03-12T13:36:52Z
format Article
id doaj.art-ef63a79ce7664a54923f3a59600f939a
institution Directory Open Access Journal
issn 2352-3964
language English
last_indexed 2024-03-12T13:36:52Z
publishDate 2023-09-01
publisher Elsevier
record_format Article
series EBioMedicine
spelling doaj.art-ef63a79ce7664a54923f3a59600f939a2023-08-24T04:35:16ZengElsevierEBioMedicine2352-39642023-09-0195104770Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in contextZhi Wei Lim0Krithi Pushpanathan1Samantha Min Er Yew2Yien Lai3Chen-Hsin Sun4Janice Sing Harn Lam5David Ziyou Chen6Jocelyn Hui Lin Goh7Marcus Chun Jin Tan8Bin Sheng9Ching-Yu Cheng10Victor Teck Chang Koh11Yih-Chung Tham12Yong Loo Lin School of Medicine, National University of Singapore, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Department of Ophthalmology, National University Hospital, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Department of Ophthalmology, National University Hospital, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Department of Ophthalmology, National University Hospital, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Department of Ophthalmology, National University Hospital, SingaporeSingapore Eye Research Institute, Singapore National Eye Centre, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Department of Ophthalmology, National University Hospital, SingaporeDepartment of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Endocrinology and Metabolism, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai, China; MoE Key Lab of Artificial Intelligence, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai, ChinaYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore; Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Department of Ophthalmology, National University Hospital, SingaporeYong Loo Lin School of Medicine, National University of Singapore, Singapore; Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore; Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore; Corresponding author. Yong Loo Lin School of Medicine, National University of Singapore, Level 13, MD1 Tahir Foundation Building, 12 Science Drive 2, 117549, Singapore.Summary: Background: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries. Methods: We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains—pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. ‘Good’ rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, ‘poor’ rated responses were further prompted for self-correction and then re-evaluated for accuracy. Findings: ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as ‘good’, compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for ‘treatment and prevention’. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% ‘good’ ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p ≤ 0.001). Interpretation: Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs’ accuracy remain crucial. Funding: Dr Yih-Chung Tham was supported by the National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001).http://www.sciencedirect.com/science/article/pii/S2352396423003365ChatGPT-4.0ChatGPT-3.5Google BardChatbotMyopiaLarge language models
spellingShingle Zhi Wei Lim
Krithi Pushpanathan
Samantha Min Er Yew
Yien Lai
Chen-Hsin Sun
Janice Sing Harn Lam
David Ziyou Chen
Jocelyn Hui Lin Goh
Marcus Chun Jin Tan
Bin Sheng
Ching-Yu Cheng
Victor Teck Chang Koh
Yih-Chung Tham
Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in context
EBioMedicine
ChatGPT-4.0
ChatGPT-3.5
Google Bard
Chatbot
Myopia
Large language models
title Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in context
title_full Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in context
title_fullStr Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in context
title_full_unstemmed Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in context
title_short Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google BardResearch in context
title_sort benchmarking large language models performances for myopia care a comparative analysis of chatgpt 3 5 chatgpt 4 0 and google bardresearch in context
topic ChatGPT-4.0
ChatGPT-3.5
Google Bard
Chatbot
Myopia
Large language models
url http://www.sciencedirect.com/science/article/pii/S2352396423003365
work_keys_str_mv AT zhiweilim benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT krithipushpanathan benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT samanthamineryew benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT yienlai benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT chenhsinsun benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT janicesingharnlam benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT davidziyouchen benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT jocelynhuilingoh benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT marcuschunjintan benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT binsheng benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT chingyucheng benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT victorteckchangkoh benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext
AT yihchungtham benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebardresearchincontext