Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany

BackgroundLarge language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English,...

Full description

Bibliographic Details
Main Authors: Jonas Roos, Adnan Kasapovic, Tom Jansen, Robert Kaczmarczyk
Format: Article
Language:English
Published: JMIR Publications 2023-09-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2023/1/e46482
_version_ 1797693096535785472
author Jonas Roos
Adnan Kasapovic
Tom Jansen
Robert Kaczmarczyk
author_facet Jonas Roos
Adnan Kasapovic
Tom Jansen
Robert Kaczmarczyk
author_sort Jonas Roos
collection DOAJ
description BackgroundLarge language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.  ObjectiveThis study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.  MethodsThe LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.  ResultsGPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.  ConclusionsLLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape. 
first_indexed 2024-03-12T02:37:37Z
format Article
id doaj.art-b98c4be203fb4d94aa63141397ddc2f9
institution Directory Open Access Journal
issn 2369-3762
language English
last_indexed 2024-03-12T02:37:37Z
publishDate 2023-09-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj.art-b98c4be203fb4d94aa63141397ddc2f92023-09-04T13:15:38ZengJMIR PublicationsJMIR Medical Education2369-37622023-09-019e4648210.2196/46482Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in GermanyJonas Rooshttps://orcid.org/0000-0001-8843-4695Adnan Kasapovichttps://orcid.org/0000-0001-6273-207XTom Jansenhttps://orcid.org/0009-0001-3842-1914Robert Kaczmarczykhttps://orcid.org/0000-0002-8570-1601 BackgroundLarge language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.  ObjectiveThis study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.  MethodsThe LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.  ResultsGPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.  ConclusionsLLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape. https://mededu.jmir.org/2023/1/e46482
spellingShingle Jonas Roos
Adnan Kasapovic
Tom Jansen
Robert Kaczmarczyk
Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
JMIR Medical Education
title Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
title_full Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
title_fullStr Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
title_full_unstemmed Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
title_short Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
title_sort artificial intelligence in medical education comparative analysis of chatgpt bing and medical students in germany
url https://mededu.jmir.org/2023/1/e46482
work_keys_str_mv AT jonasroos artificialintelligenceinmedicaleducationcomparativeanalysisofchatgptbingandmedicalstudentsingermany
AT adnankasapovic artificialintelligenceinmedicaleducationcomparativeanalysisofchatgptbingandmedicalstudentsingermany
AT tomjansen artificialintelligenceinmedicaleducationcomparativeanalysisofchatgptbingandmedicalstudentsingermany
AT robertkaczmarczyk artificialintelligenceinmedicaleducationcomparativeanalysisofchatgptbingandmedicalstudentsingermany