Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis

Abstract Artificial intelligence (AI) algorithms evaluating [supine] chest radiographs ([S]CXRs) have remarkably increased in number recently. Since training and validation are often performed on subsets of the same overall dataset, external validation is mandatory to reproduce results and reveal po...

Full description

Bibliographic Details
Main Authors: Jan Rudolph, Balthasar Schachtner, Nicola Fink, Vanessa Koliogiannis, Vincent Schwarze, Sophia Goller, Lena Trappmann, Boj F. Hoppe, Nabeel Mansour, Maximilian Fischer, Najib Ben Khaled, Maximilian Jörgens, Julien Dinkel, Wolfgang G. Kunz, Jens Ricke, Michael Ingrisch, Bastian O. Sabel, Johannes Rueckel
Format: Article
Language:English
Published: Nature Portfolio 2022-07-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-022-16514-7
_version_ 1818502560121618432
author Jan Rudolph
Balthasar Schachtner
Nicola Fink
Vanessa Koliogiannis
Vincent Schwarze
Sophia Goller
Lena Trappmann
Boj F. Hoppe
Nabeel Mansour
Maximilian Fischer
Najib Ben Khaled
Maximilian Jörgens
Julien Dinkel
Wolfgang G. Kunz
Jens Ricke
Michael Ingrisch
Bastian O. Sabel
Johannes Rueckel
author_facet Jan Rudolph
Balthasar Schachtner
Nicola Fink
Vanessa Koliogiannis
Vincent Schwarze
Sophia Goller
Lena Trappmann
Boj F. Hoppe
Nabeel Mansour
Maximilian Fischer
Najib Ben Khaled
Maximilian Jörgens
Julien Dinkel
Wolfgang G. Kunz
Jens Ricke
Michael Ingrisch
Bastian O. Sabel
Johannes Rueckel
author_sort Jan Rudolph
collection DOAJ
description Abstract Artificial intelligence (AI) algorithms evaluating [supine] chest radiographs ([S]CXRs) have remarkably increased in number recently. Since training and validation are often performed on subsets of the same overall dataset, external validation is mandatory to reproduce results and reveal potential training errors. We applied a multicohort benchmarking to the publicly accessible (S)CXR analyzing AI algorithm CheXNet, comprising three clinically relevant study cohorts which differ in patient positioning ([S]CXRs), the applied reference standards (CT-/[S]CXR-based) and the possibility to also compare algorithm classification with different medical experts’ reading performance. The study cohorts include [1] a cohort, characterized by 563 CXRs acquired in the emergency unit that were evaluated by 9 readers (radiologists and non-radiologists) in terms of 4 common pathologies, [2] a collection of 6,248 SCXRs annotated by radiologists in terms of pneumothorax presence, its size and presence of inserted thoracic tube material which allowed for subgroup and confounding bias analysis and [3] a cohort consisting of 166 patients with SCXRs that were evaluated by radiologists for underlying causes of basal lung opacities, all of those cases having been correlated to a timely acquired computed tomography scan (SCXR and CT within < 90 min). CheXNet non-significantly exceeded the radiology resident (RR) consensus in the detection of suspicious lung nodules (cohort [1], AUC AI/RR: 0.851/0.839, p = 0.793) and the radiological readers in the detection of basal pneumonia (cohort [3], AUC AI/reader consensus: 0.825/0.782, p = 0.390) and basal pleural effusion (cohort [3], AUC AI/reader consensus: 0.762/0.710, p = 0.336) in SCXR, partly with AUC values higher than originally published (“Nodule”: 0.780, “Infiltration”: 0.735, “Effusion”: 0.864). The classifier “Infiltration” turned out to be very dependent on patient positioning (best in CXR, worst in SCXR). The pneumothorax SCXR cohort [2] revealed poor algorithm performance in CXRs without inserted thoracic material and in the detection of small pneumothoraces, which can be explained by a known systematic confounding error in the algorithm training process. The benefit of clinically relevant external validation is demonstrated by the differences in algorithm performance as compared to the original publication. Our multi-cohort benchmarking finally enables the consideration of confounders, different reference standards and patient positioning as well as the AI performance comparison with differentially qualified medical readers.
first_indexed 2024-12-10T21:11:54Z
format Article
id doaj.art-ad5365621b25497093f989c0ffc4da95
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-12-10T21:11:54Z
publishDate 2022-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-ad5365621b25497093f989c0ffc4da952022-12-22T01:33:25ZengNature PortfolioScientific Reports2045-23222022-07-0112111110.1038/s41598-022-16514-7Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysisJan Rudolph0Balthasar Schachtner1Nicola Fink2Vanessa Koliogiannis3Vincent Schwarze4Sophia Goller5Lena Trappmann6Boj F. Hoppe7Nabeel Mansour8Maximilian Fischer9Najib Ben Khaled10Maximilian Jörgens11Julien Dinkel12Wolfgang G. Kunz13Jens Ricke14Michael Ingrisch15Bastian O. Sabel16Johannes Rueckel17Department of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Medicine I, University Hospital, LMU MunichDepartment of Medicine II, University Hospital, LMU MunichDepartment of Orthopaedics and Trauma Surgery, Musculoskeletal University Center Munich (MUM), University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichDepartment of Radiology, University Hospital, LMU MunichAbstract Artificial intelligence (AI) algorithms evaluating [supine] chest radiographs ([S]CXRs) have remarkably increased in number recently. Since training and validation are often performed on subsets of the same overall dataset, external validation is mandatory to reproduce results and reveal potential training errors. We applied a multicohort benchmarking to the publicly accessible (S)CXR analyzing AI algorithm CheXNet, comprising three clinically relevant study cohorts which differ in patient positioning ([S]CXRs), the applied reference standards (CT-/[S]CXR-based) and the possibility to also compare algorithm classification with different medical experts’ reading performance. The study cohorts include [1] a cohort, characterized by 563 CXRs acquired in the emergency unit that were evaluated by 9 readers (radiologists and non-radiologists) in terms of 4 common pathologies, [2] a collection of 6,248 SCXRs annotated by radiologists in terms of pneumothorax presence, its size and presence of inserted thoracic tube material which allowed for subgroup and confounding bias analysis and [3] a cohort consisting of 166 patients with SCXRs that were evaluated by radiologists for underlying causes of basal lung opacities, all of those cases having been correlated to a timely acquired computed tomography scan (SCXR and CT within < 90 min). CheXNet non-significantly exceeded the radiology resident (RR) consensus in the detection of suspicious lung nodules (cohort [1], AUC AI/RR: 0.851/0.839, p = 0.793) and the radiological readers in the detection of basal pneumonia (cohort [3], AUC AI/reader consensus: 0.825/0.782, p = 0.390) and basal pleural effusion (cohort [3], AUC AI/reader consensus: 0.762/0.710, p = 0.336) in SCXR, partly with AUC values higher than originally published (“Nodule”: 0.780, “Infiltration”: 0.735, “Effusion”: 0.864). The classifier “Infiltration” turned out to be very dependent on patient positioning (best in CXR, worst in SCXR). The pneumothorax SCXR cohort [2] revealed poor algorithm performance in CXRs without inserted thoracic material and in the detection of small pneumothoraces, which can be explained by a known systematic confounding error in the algorithm training process. The benefit of clinically relevant external validation is demonstrated by the differences in algorithm performance as compared to the original publication. Our multi-cohort benchmarking finally enables the consideration of confounders, different reference standards and patient positioning as well as the AI performance comparison with differentially qualified medical readers.https://doi.org/10.1038/s41598-022-16514-7
spellingShingle Jan Rudolph
Balthasar Schachtner
Nicola Fink
Vanessa Koliogiannis
Vincent Schwarze
Sophia Goller
Lena Trappmann
Boj F. Hoppe
Nabeel Mansour
Maximilian Fischer
Najib Ben Khaled
Maximilian Jörgens
Julien Dinkel
Wolfgang G. Kunz
Jens Ricke
Michael Ingrisch
Bastian O. Sabel
Johannes Rueckel
Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis
Scientific Reports
title Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis
title_full Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis
title_fullStr Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis
title_full_unstemmed Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis
title_short Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis
title_sort clinically focused multi cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis
url https://doi.org/10.1038/s41598-022-16514-7
work_keys_str_mv AT janrudolph clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT balthasarschachtner clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT nicolafink clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT vanessakoliogiannis clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT vincentschwarze clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT sophiagoller clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT lenatrappmann clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT bojfhoppe clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT nabeelmansour clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT maximilianfischer clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT najibbenkhaled clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT maximilianjorgens clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT juliendinkel clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT wolfganggkunz clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT jensricke clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT michaelingrisch clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT bastianosabel clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis
AT johannesrueckel clinicallyfocusedmulticohortbenchmarkingasatoolforexternalvalidationofartificialintelligencealgorithmperformanceinbasicchestradiographyanalysis