A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study

BackgroundChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinat...

Full description

Bibliographic Details
Main Authors: Cai Long, Kayle Lowe, Jessica Zhang, André dos Santos, Alaa Alanazi, Daniel O'Brien, Erin D Wright, David Cote
Format: Article
Language:English
Published: JMIR Publications 2024-01-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2024/1/e49970
_version_ 1797353896608268288
author Cai Long
Kayle Lowe
Jessica Zhang
André dos Santos
Alaa Alanazi
Daniel O'Brien
Erin D Wright
David Cote
author_facet Cai Long
Kayle Lowe
Jessica Zhang
André dos Santos
Alaa Alanazi
Daniel O'Brien
Erin D Wright
David Cote
author_sort Cai Long
collection DOAJ
description BackgroundChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. ObjectiveWe aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions. MethodsTwenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. ResultsIn an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. ConclusionsChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation.
first_indexed 2024-03-08T13:37:33Z
format Article
id doaj.art-925c19aa61de4911b651e933afc650bf
institution Directory Open Access Journal
issn 2369-3762
language English
last_indexed 2024-03-08T13:37:33Z
publishDate 2024-01-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj.art-925c19aa61de4911b651e933afc650bf2024-01-16T14:45:53ZengJMIR PublicationsJMIR Medical Education2369-37622024-01-0110e4997010.2196/49970A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance StudyCai Longhttps://orcid.org/0000-0002-5311-7355Kayle Lowehttps://orcid.org/0009-0006-7940-333XJessica Zhanghttps://orcid.org/0000-0002-1578-8529André dos Santoshttps://orcid.org/0009-0000-0393-3082Alaa Alanazihttps://orcid.org/0000-0001-8096-9118Daniel O'Brienhttps://orcid.org/0000-0002-8394-9902Erin D Wrighthttps://orcid.org/0000-0001-5601-2754David Cotehttps://orcid.org/0000-0001-8971-6969 BackgroundChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. ObjectiveWe aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions. MethodsTwenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. ResultsIn an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. ConclusionsChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation.https://mededu.jmir.org/2024/1/e49970
spellingShingle Cai Long
Kayle Lowe
Jessica Zhang
André dos Santos
Alaa Alanazi
Daniel O'Brien
Erin D Wright
David Cote
A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study
JMIR Medical Education
title A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study
title_full A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study
title_fullStr A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study
title_full_unstemmed A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study
title_short A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study
title_sort novel evaluation model for assessing chatgpt on otolaryngology head and neck surgery certification examinations performance study
url https://mededu.jmir.org/2024/1/e49970
work_keys_str_mv AT cailong anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT kaylelowe anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT jessicazhang anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT andredossantos anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT alaaalanazi anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT danielobrien anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT erindwright anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT davidcote anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT cailong novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT kaylelowe novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT jessicazhang novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT andredossantos novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT alaaalanazi novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT danielobrien novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT erindwright novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy
AT davidcote novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy