Evaluating large language models as agents in the clinic

Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conver...

Full description

Bibliographic Details
Main Authors: Nikita Mehandru, Brenda Y. Miao, Eduardo Rodriguez Almaraz, Madhumita Sushil, Atul J. Butte, Ahmed Alaa
Format: Article
Language:English
Published: Nature Portfolio 2024-04-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-024-01083-y
_version_ 1797219557194072064
author Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
author_facet Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
author_sort Nikita Mehandru
collection DOAJ
description Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.
first_indexed 2024-04-24T12:35:32Z
format Article
id doaj.art-d5c48901ff664ec585eb34e9723b1362
institution Directory Open Access Journal
issn 2398-6352
language English
last_indexed 2024-04-24T12:35:32Z
publishDate 2024-04-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj.art-d5c48901ff664ec585eb34e9723b13622024-04-07T11:31:52ZengNature Portfolionpj Digital Medicine2398-63522024-04-01711310.1038/s41746-024-01083-yEvaluating large language models as agents in the clinicNikita Mehandru0Brenda Y. Miao1Eduardo Rodriguez Almaraz2Madhumita Sushil3Atul J. Butte4Ahmed Alaa5University of California, BerkeleyBakar Computational Health Sciences Institute, University of California San FranciscoNeurosurgery Department Division of Neuro-Oncology, University of California San FranciscoBakar Computational Health Sciences Institute, University of California San FranciscoBakar Computational Health Sciences Institute, University of California San FranciscoUniversity of California, BerkeleyRecent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.https://doi.org/10.1038/s41746-024-01083-y
spellingShingle Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
Evaluating large language models as agents in the clinic
npj Digital Medicine
title Evaluating large language models as agents in the clinic
title_full Evaluating large language models as agents in the clinic
title_fullStr Evaluating large language models as agents in the clinic
title_full_unstemmed Evaluating large language models as agents in the clinic
title_short Evaluating large language models as agents in the clinic
title_sort evaluating large language models as agents in the clinic
url https://doi.org/10.1038/s41746-024-01083-y
work_keys_str_mv AT nikitamehandru evaluatinglargelanguagemodelsasagentsintheclinic
AT brendaymiao evaluatinglargelanguagemodelsasagentsintheclinic
AT eduardorodriguezalmaraz evaluatinglargelanguagemodelsasagentsintheclinic
AT madhumitasushil evaluatinglargelanguagemodelsasagentsintheclinic
AT atuljbutte evaluatinglargelanguagemodelsasagentsintheclinic
AT ahmedalaa evaluatinglargelanguagemodelsasagentsintheclinic