Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

BackgroundThe United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools,...

Full description

Bibliographic Details
Main Authors: Leonard Knoedler, Michael Alfertshofer, Samuel Knoedler, Cosima C Hoch, Paul F Funk, Sebastian Cotofana, Bhagvat Maheta, Konstantin Frank, Vanessa Brébant, Lukas Prantl, Philipp Lamby
Format: Article
Language:English
Published: JMIR Publications 2024-01-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2024/1/e51148
_version_ 1797364617044819968
author Leonard Knoedler
Michael Alfertshofer
Samuel Knoedler
Cosima C Hoch
Paul F Funk
Sebastian Cotofana
Bhagvat Maheta
Konstantin Frank
Vanessa Brébant
Lukas Prantl
Philipp Lamby
author_facet Leonard Knoedler
Michael Alfertshofer
Samuel Knoedler
Cosima C Hoch
Paul F Funk
Sebastian Cotofana
Bhagvat Maheta
Konstantin Frank
Vanessa Brébant
Lukas Prantl
Philipp Lamby
author_sort Leonard Knoedler
collection DOAJ
description BackgroundThe United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. ObjectiveThis paper aimed to analyze ChatGPT’s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. MethodsA total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. ResultsOverall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=–0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=–0.289 for ChatGPT 3.5 and ρ=–0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. ConclusionsIn this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.
first_indexed 2024-03-08T16:37:48Z
format Article
id doaj.art-fe08f10aba94433eb4e347c892e86be6
institution Directory Open Access Journal
issn 2369-3762
language English
last_indexed 2024-03-08T16:37:48Z
publishDate 2024-01-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj.art-fe08f10aba94433eb4e347c892e86be62024-01-05T15:00:27ZengJMIR PublicationsJMIR Medical Education2369-37622024-01-0110e5114810.2196/51148Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative AnalysisLeonard Knoedlerhttps://orcid.org/0000-0002-8949-3168Michael Alfertshoferhttps://orcid.org/0000-0002-4892-2376Samuel Knoedlerhttps://orcid.org/0000-0001-5798-8003Cosima C Hochhttps://orcid.org/0000-0002-3875-7389Paul F Funkhttps://orcid.org/0009-0000-4316-4249Sebastian Cotofanahttps://orcid.org/0000-0001-7210-6566Bhagvat Mahetahttps://orcid.org/0000-0002-5318-3088Konstantin Frankhttps://orcid.org/0000-0001-6994-8877Vanessa Brébanthttps://orcid.org/0000-0003-3144-4459Lukas Prantlhttps://orcid.org/0000-0003-2454-2499Philipp Lambyhttps://orcid.org/0000-0003-0815-5712 BackgroundThe United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. ObjectiveThis paper aimed to analyze ChatGPT’s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. MethodsA total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. ResultsOverall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=–0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=–0.289 for ChatGPT 3.5 and ρ=–0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. ConclusionsIn this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.https://mededu.jmir.org/2024/1/e51148
spellingShingle Leonard Knoedler
Michael Alfertshofer
Samuel Knoedler
Cosima C Hoch
Paul F Funk
Sebastian Cotofana
Bhagvat Maheta
Konstantin Frank
Vanessa Brébant
Lukas Prantl
Philipp Lamby
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
JMIR Medical Education
title Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
title_full Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
title_fullStr Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
title_full_unstemmed Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
title_short Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
title_sort pure wisdom or potemkin villages a comparison of chatgpt 3 5 and chatgpt 4 on usmle step 3 style questions quantitative analysis
url https://mededu.jmir.org/2024/1/e51148
work_keys_str_mv AT leonardknoedler purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT michaelalfertshofer purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT samuelknoedler purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT cosimachoch purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT paulffunk purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT sebastiancotofana purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT bhagvatmaheta purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT konstantinfrank purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT vanessabrebant purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT lukasprantl purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis
AT philipplamby purewisdomorpotemkinvillagesacomparisonofchatgpt35andchatgpt4onusmlestep3stylequestionsquantitativeanalysis