Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

BackgroundLarge language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging tec...

Full description

Bibliographic Details
Main Authors: Arun James Thirunavukarasu, Refaat Hassan, Shathar Mahmood, Rohan Sanghera, Kara Barzangi, Mohanned El Mukashfi, Sachin Shah
Format: Article
Language:English
Published: JMIR Publications 2023-04-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2023/1/e46599
_version_ 1797734154874388480
author Arun James Thirunavukarasu
Refaat Hassan
Shathar Mahmood
Rohan Sanghera
Kara Barzangi
Mohanned El Mukashfi
Sachin Shah
author_facet Arun James Thirunavukarasu
Refaat Hassan
Shathar Mahmood
Rohan Sanghera
Kara Barzangi
Mohanned El Mukashfi
Sachin Shah
author_sort Arun James Thirunavukarasu
collection DOAJ
description BackgroundLarge language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. ObjectiveHere, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. MethodsAKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports from 2018 to 2022. Novel explanations from ChatGPT—defined as information provided that was not inputted within the question or multiple answer choices—were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT’s strengths and weaknesses. ResultsAverage overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT’s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=–0.241 and –0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). ConclusionsLarge language models are approaching human expert–level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.
first_indexed 2024-03-12T12:39:12Z
format Article
id doaj.art-40b47fdcf9f643c88aef93d12bdb0dd2
institution Directory Open Access Journal
issn 2369-3762
language English
last_indexed 2024-03-12T12:39:12Z
publishDate 2023-04-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj.art-40b47fdcf9f643c88aef93d12bdb0dd22023-08-28T23:56:34ZengJMIR PublicationsJMIR Medical Education2369-37622023-04-019e4659910.2196/46599Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary CareArun James Thirunavukarasuhttps://orcid.org/0000-0001-8968-4768Refaat Hassanhttps://orcid.org/0000-0002-3054-1161Shathar Mahmoodhttps://orcid.org/0009-0008-4209-1306Rohan Sangherahttps://orcid.org/0000-0001-6370-8426Kara Barzangihttps://orcid.org/0009-0009-0327-1221Mohanned El Mukashfihttps://orcid.org/0009-0001-8158-0216Sachin Shahhttps://orcid.org/0009-0008-2470-6143 BackgroundLarge language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. ObjectiveHere, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. MethodsAKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports from 2018 to 2022. Novel explanations from ChatGPT—defined as information provided that was not inputted within the question or multiple answer choices—were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT’s strengths and weaknesses. ResultsAverage overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT’s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=–0.241 and –0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). ConclusionsLarge language models are approaching human expert–level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.https://mededu.jmir.org/2023/1/e46599
spellingShingle Arun James Thirunavukarasu
Refaat Hassan
Shathar Mahmood
Rohan Sanghera
Kara Barzangi
Mohanned El Mukashfi
Sachin Shah
Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
JMIR Medical Education
title Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_full Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_fullStr Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_full_unstemmed Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_short Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_sort trialling a large language model chatgpt in general practice with the applied knowledge test observational study demonstrating opportunities and limitations in primary care
url https://mededu.jmir.org/2023/1/e46599
work_keys_str_mv AT arunjamesthirunavukarasu triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT refaathassan triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT shatharmahmood triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT rohansanghera triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT karabarzangi triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT mohannedelmukashfi triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT sachinshah triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare