Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study

BackgroundThe systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subse...

Full description

Bibliographic Details
Main Authors:	Eddie Guo, Mehul Gupta, Jiawen Deng, Ye-Jean Park, Michael Paget, Christopher Naugler
Format:	Article
Language:	English
Published:	JMIR Publications 2024-01-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2024/1/e48996

_version_	1797356535031005184
author	Eddie Guo Mehul Gupta Jiawen Deng Ye-Jean Park Michael Paget Christopher Naugler
author_facet	Eddie Guo Mehul Gupta Jiawen Deng Ye-Jean Park Michael Paget Christopher Naugler
author_sort	Eddie Guo
collection	DOAJ
description	BackgroundThe systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources. ObjectiveThis study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers. MethodsWe introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts. ResultsOur results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications. ConclusionsLarge language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.
first_indexed	2024-03-08T14:29:01Z
format	Article
id	doaj.art-63a8085605284fc2a2db1d851ea51dd7
institution	Directory Open Access Journal
issn	1438-8871
language	English
last_indexed	2024-03-08T14:29:01Z
publishDate	2024-01-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj.art-63a8085605284fc2a2db1d851ea51dd72024-01-12T15:00:36ZengJMIR PublicationsJournal of Medical Internet Research1438-88712024-01-0126e4899610.2196/48996Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis StudyEddie Guohttps://orcid.org/0000-0002-7223-0505Mehul Guptahttps://orcid.org/0000-0001-7931-0666Jiawen Denghttps://orcid.org/0000-0002-8274-6468Ye-Jean Parkhttps://orcid.org/0009-0008-1068-8992Michael Pagethttps://orcid.org/0000-0002-3322-7661Christopher Nauglerhttps://orcid.org/0000-0002-4570-1279 BackgroundThe systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources. ObjectiveThis study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers. MethodsWe introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts. ResultsOur results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications. ConclusionsLarge language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.https://www.jmir.org/2024/1/e48996
spellingShingle	Eddie Guo Mehul Gupta Jiawen Deng Ye-Jean Park Michael Paget Christopher Naugler Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study Journal of Medical Internet Research
title	Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
title_full	Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
title_fullStr	Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
title_full_unstemmed	Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
title_short	Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
title_sort	automated paper screening for clinical reviews using large language models data analysis study
url	https://www.jmir.org/2024/1/e48996
work_keys_str_mv	AT eddieguo automatedpaperscreeningforclinicalreviewsusinglargelanguagemodelsdataanalysisstudy AT mehulgupta automatedpaperscreeningforclinicalreviewsusinglargelanguagemodelsdataanalysisstudy AT jiawendeng automatedpaperscreeningforclinicalreviewsusinglargelanguagemodelsdataanalysisstudy AT yejeanpark automatedpaperscreeningforclinicalreviewsusinglargelanguagemodelsdataanalysisstudy AT michaelpaget automatedpaperscreeningforclinicalreviewsusinglargelanguagemodelsdataanalysisstudy AT christophernaugler automatedpaperscreeningforclinicalreviewsusinglargelanguagemodelsdataanalysisstudy

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study

Similar Items