A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study
BackgroundChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinat...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
JMIR Publications
2024-01-01
|
Series: | JMIR Medical Education |
Online Access: | https://mededu.jmir.org/2024/1/e49970 |
_version_ | 1797353896608268288 |
---|---|
author | Cai Long Kayle Lowe Jessica Zhang André dos Santos Alaa Alanazi Daniel O'Brien Erin D Wright David Cote |
author_facet | Cai Long Kayle Lowe Jessica Zhang André dos Santos Alaa Alanazi Daniel O'Brien Erin D Wright David Cote |
author_sort | Cai Long |
collection | DOAJ |
description |
BackgroundChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported.
ObjectiveWe aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions.
MethodsTwenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance.
ResultsIn an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed.
ConclusionsChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation. |
first_indexed | 2024-03-08T13:37:33Z |
format | Article |
id | doaj.art-925c19aa61de4911b651e933afc650bf |
institution | Directory Open Access Journal |
issn | 2369-3762 |
language | English |
last_indexed | 2024-03-08T13:37:33Z |
publishDate | 2024-01-01 |
publisher | JMIR Publications |
record_format | Article |
series | JMIR Medical Education |
spelling | doaj.art-925c19aa61de4911b651e933afc650bf2024-01-16T14:45:53ZengJMIR PublicationsJMIR Medical Education2369-37622024-01-0110e4997010.2196/49970A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance StudyCai Longhttps://orcid.org/0000-0002-5311-7355Kayle Lowehttps://orcid.org/0009-0006-7940-333XJessica Zhanghttps://orcid.org/0000-0002-1578-8529André dos Santoshttps://orcid.org/0009-0000-0393-3082Alaa Alanazihttps://orcid.org/0000-0001-8096-9118Daniel O'Brienhttps://orcid.org/0000-0002-8394-9902Erin D Wrighthttps://orcid.org/0000-0001-5601-2754David Cotehttps://orcid.org/0000-0001-8971-6969 BackgroundChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. ObjectiveWe aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions. MethodsTwenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. ResultsIn an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. ConclusionsChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation.https://mededu.jmir.org/2024/1/e49970 |
spellingShingle | Cai Long Kayle Lowe Jessica Zhang André dos Santos Alaa Alanazi Daniel O'Brien Erin D Wright David Cote A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study JMIR Medical Education |
title | A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study |
title_full | A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study |
title_fullStr | A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study |
title_full_unstemmed | A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study |
title_short | A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study |
title_sort | novel evaluation model for assessing chatgpt on otolaryngology head and neck surgery certification examinations performance study |
url | https://mededu.jmir.org/2024/1/e49970 |
work_keys_str_mv | AT cailong anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT kaylelowe anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT jessicazhang anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT andredossantos anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT alaaalanazi anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT danielobrien anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT erindwright anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT davidcote anovelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT cailong novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT kaylelowe novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT jessicazhang novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT andredossantos novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT alaaalanazi novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT danielobrien novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT erindwright novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy AT davidcote novelevaluationmodelforassessingchatgptonotolaryngologyheadandnecksurgerycertificationexaminationsperformancestudy |