Methodological insights into ChatGPT’s screening performance in systematic reviews

Abstract Background The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. T...

Full description

Bibliographic Details
Main Authors: Mahbod Issaiy, Hossein Ghanaati, Shahriar Kolahi, Madjid Shakiba, Amir Hossein Jalali, Diana Zarei, Sina Kazemian, Mahsa Alborzi Avanaki, Kavous Firouznia
Format: Article
Language:English
Published: BMC 2024-03-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:https://doi.org/10.1186/s12874-024-02203-8
_version_ 1797233487615361024
author Mahbod Issaiy
Hossein Ghanaati
Shahriar Kolahi
Madjid Shakiba
Amir Hossein Jalali
Diana Zarei
Sina Kazemian
Mahsa Alborzi Avanaki
Kavous Firouznia
author_facet Mahbod Issaiy
Hossein Ghanaati
Shahriar Kolahi
Madjid Shakiba
Amir Hossein Jalali
Diana Zarei
Sina Kazemian
Mahsa Alborzi Avanaki
Kavous Firouznia
author_sort Mahbod Issaiy
collection DOAJ
description Abstract Background The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data. Methods A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT’s performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals. Results ChatGPT completed the screening process within an hour, while GPs took an average of 7–10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs’ sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27. Conclusions ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.
first_indexed 2024-04-24T16:16:57Z
format Article
id doaj.art-c8eacc790ef3462b9cb2272a0f7458f7
institution Directory Open Access Journal
issn 1471-2288
language English
last_indexed 2024-04-24T16:16:57Z
publishDate 2024-03-01
publisher BMC
record_format Article
series BMC Medical Research Methodology
spelling doaj.art-c8eacc790ef3462b9cb2272a0f7458f72024-03-31T11:24:03ZengBMCBMC Medical Research Methodology1471-22882024-03-0124111110.1186/s12874-024-02203-8Methodological insights into ChatGPT’s screening performance in systematic reviewsMahbod Issaiy0Hossein Ghanaati1Shahriar Kolahi2Madjid Shakiba3Amir Hossein Jalali4Diana Zarei5Sina Kazemian6Mahsa Alborzi Avanaki7Kavous Firouznia8Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceAdvanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceAdvanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceAdvanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceAdvanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceAdvanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceCardiac Primary Prevention Research Center, Cardiovascular Diseases Research Institute, Tehran University of Medical SciencesAdvanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceAdvanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical ScienceAbstract Background The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data. Methods A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT’s performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals. Results ChatGPT completed the screening process within an hour, while GPs took an average of 7–10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs’ sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27. Conclusions ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.https://doi.org/10.1186/s12874-024-02203-8Systematic reviewChatGPTAILarge language modelArticle screeningRadiology
spellingShingle Mahbod Issaiy
Hossein Ghanaati
Shahriar Kolahi
Madjid Shakiba
Amir Hossein Jalali
Diana Zarei
Sina Kazemian
Mahsa Alborzi Avanaki
Kavous Firouznia
Methodological insights into ChatGPT’s screening performance in systematic reviews
BMC Medical Research Methodology
Systematic review
ChatGPT
AI
Large language model
Article screening
Radiology
title Methodological insights into ChatGPT’s screening performance in systematic reviews
title_full Methodological insights into ChatGPT’s screening performance in systematic reviews
title_fullStr Methodological insights into ChatGPT’s screening performance in systematic reviews
title_full_unstemmed Methodological insights into ChatGPT’s screening performance in systematic reviews
title_short Methodological insights into ChatGPT’s screening performance in systematic reviews
title_sort methodological insights into chatgpt s screening performance in systematic reviews
topic Systematic review
ChatGPT
AI
Large language model
Article screening
Radiology
url https://doi.org/10.1186/s12874-024-02203-8
work_keys_str_mv AT mahbodissaiy methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT hosseinghanaati methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT shahriarkolahi methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT madjidshakiba methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT amirhosseinjalali methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT dianazarei methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT sinakazemian methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT mahsaalborziavanaki methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews
AT kavousfirouznia methodologicalinsightsintochatgptsscreeningperformanceinsystematicreviews