The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study

BackgroundWhether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. ObjectiveWe aim to assess the accuracy of GPT-4 in the diagn...

Full description

Bibliographic Details
Main Authors: Naoki Ito, Sakina Kadomatsu, Mineto Fujisawa, Kiyomitsu Fukaguchi, Ryo Ishizawa, Naoki Kanda, Daisuke Kasugai, Mikio Nakajima, Tadahiro Goto, Yusuke Tsugawa
Format: Article
Language:English
Published: JMIR Publications 2023-11-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2023/1/e47532
_version_ 1797641117766778880
author Naoki Ito
Sakina Kadomatsu
Mineto Fujisawa
Kiyomitsu Fukaguchi
Ryo Ishizawa
Naoki Kanda
Daisuke Kasugai
Mikio Nakajima
Tadahiro Goto
Yusuke Tsugawa
author_facet Naoki Ito
Sakina Kadomatsu
Mineto Fujisawa
Kiyomitsu Fukaguchi
Ryo Ishizawa
Naoki Kanda
Daisuke Kasugai
Mikio Nakajima
Tadahiro Goto
Yusuke Tsugawa
author_sort Naoki Ito
collection DOAJ
description BackgroundWhether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. ObjectiveWe aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. MethodsWe compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as “correct” or “incorrect.” Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. ResultsThe accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients’ race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. ConclusionsGPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage.
first_indexed 2024-03-11T13:40:58Z
format Article
id doaj.art-fcc369dc22e34925936b40cf22331f2c
institution Directory Open Access Journal
issn 2369-3762
language English
last_indexed 2024-03-11T13:40:58Z
publishDate 2023-11-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj.art-fcc369dc22e34925936b40cf22331f2c2023-11-02T13:45:33ZengJMIR PublicationsJMIR Medical Education2369-37622023-11-019e4753210.2196/47532The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation StudyNaoki Itohttps://orcid.org/0009-0005-6135-7868Sakina Kadomatsuhttps://orcid.org/0009-0005-4236-216XMineto Fujisawahttps://orcid.org/0009-0006-5064-1879Kiyomitsu Fukaguchihttps://orcid.org/0000-0003-2262-1898Ryo Ishizawahttps://orcid.org/0000-0002-6324-7399Naoki Kandahttps://orcid.org/0000-0001-8003-534XDaisuke Kasugaihttps://orcid.org/0000-0002-8692-3003Mikio Nakajimahttps://orcid.org/0000-0002-2903-1092Tadahiro Gotohttps://orcid.org/0000-0002-5880-2968Yusuke Tsugawahttps://orcid.org/0000-0002-1937-4833 BackgroundWhether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. ObjectiveWe aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. MethodsWe compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as “correct” or “incorrect.” Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. ResultsThe accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients’ race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. ConclusionsGPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage.https://mededu.jmir.org/2023/1/e47532
spellingShingle Naoki Ito
Sakina Kadomatsu
Mineto Fujisawa
Kiyomitsu Fukaguchi
Ryo Ishizawa
Naoki Kanda
Daisuke Kasugai
Mikio Nakajima
Tadahiro Goto
Yusuke Tsugawa
The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
JMIR Medical Education
title The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_full The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_fullStr The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_full_unstemmed The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_short The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_sort accuracy and potential racial and ethnic biases of gpt 4 in the diagnosis and triage of health conditions evaluation study
url https://mededu.jmir.org/2023/1/e47532
work_keys_str_mv AT naokiito theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT sakinakadomatsu theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT minetofujisawa theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kiyomitsufukaguchi theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT ryoishizawa theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT naokikanda theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT daisukekasugai theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT mikionakajima theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT tadahirogoto theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT yusuketsugawa theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT naokiito accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT sakinakadomatsu accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT minetofujisawa accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kiyomitsufukaguchi accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT ryoishizawa accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT naokikanda accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT daisukekasugai accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT mikionakajima accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT tadahirogoto accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT yusuketsugawa accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy