Evaluating Bias in Machine Learning-Enabled Radiology Image Classification

As machine learning grows more prevalent in the medical field, it is important to ensure that fairness is considered as a central criterion in the evaluation of algorithms and models. Building upon previous work, we study a set of machine learning models used to detect spinal fractures, comparing th...

Full description

Bibliographic Details
Main Author: Atia, Dina
Other Authors: Ghassemi, Marzyeh
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/151662
Description
Summary:As machine learning grows more prevalent in the medical field, it is important to ensure that fairness is considered as a central criterion in the evaluation of algorithms and models. Building upon previous work, we study a set of machine learning models used to detect spinal fractures, comparing their performance across various age, sex, and geographic groups. This serves not only as an audit of this particular set of models but also contributes to the development of a meaningful standard for fairness in the space of Machine Learning for Healthcare. We analyze the 10 highest-performing models from a competition hosted by the Radiological Society of North America in 2022. In this competition, teams competed to design and train machine learning models to detect and locate cervical spine fractures, a severe injury with high mortality rates, with high accuracy. We split the data into subgroups across the categories of sex, age, and continent, then compare them across seven performance metrics. We find the models to be fair overall, with similar performance across the given metrics. Additionally, we perform an intersectional analysis, where we compare the same metrics, but instead split the data based on intersections of the above attributes, and again find fair overall performance. Taking a holistic look at the results, the models appear to be fair under a variety of comparative metrics. However, future work is needed to determine whether or not the models we studied would in fact be fair for a more representative population.