Automated Interpretation of Machine Learning Models

As machine learning (ML) models are increasingly deployed in production, there’s a pressing need to ensure their reliability through auditing, debugging, and testing. Interpretability, the subfield that studies how ML models make decisions, aspires to meet this need but traditionally relies on human...

Full description

Bibliographic Details
Main Author:	Hernandez, Evan
Other Authors:	Andreas, Jacob
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156277

_version_	1811098299513438208
author	Hernandez, Evan
author2	Andreas, Jacob
author_facet	Andreas, Jacob Hernandez, Evan
author_sort	Hernandez, Evan
collection	MIT
description	As machine learning (ML) models are increasingly deployed in production, there’s a pressing need to ensure their reliability through auditing, debugging, and testing. Interpretability, the subfield that studies how ML models make decisions, aspires to meet this need but traditionally relies on human-led experimentation or is based on human priors about what the model has learned. In this thesis, I propose that interpretability should evolve alongside ML by adopting automated techniques that use ML models to interpret ML models. This shift towards automation allows for more comprehensive analyses of ML models without requiring human scrutiny at every step, and the effectiveness of these methods should improve as the ML models themselves become more sophisticated. I present three examples of automated interpretability approaches: using a captioning model to label features of other models, manipulating a ML model’s internal representations to predict and correct errors, and identifying simple internal circuits through approximating the ML model itself. These examples lay the groundwork for future efforts in automating ML model interpretation.
first_indexed	2024-09-23T17:12:52Z
format	Thesis
id	mit-1721.1/156277
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T17:12:52Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1562772024-08-22T03:00:43Z Automated Interpretation of Machine Learning Models Hernandez, Evan Andreas, Jacob Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science As machine learning (ML) models are increasingly deployed in production, there’s a pressing need to ensure their reliability through auditing, debugging, and testing. Interpretability, the subfield that studies how ML models make decisions, aspires to meet this need but traditionally relies on human-led experimentation or is based on human priors about what the model has learned. In this thesis, I propose that interpretability should evolve alongside ML by adopting automated techniques that use ML models to interpret ML models. This shift towards automation allows for more comprehensive analyses of ML models without requiring human scrutiny at every step, and the effectiveness of these methods should improve as the ML models themselves become more sophisticated. I present three examples of automated interpretability approaches: using a captioning model to label features of other models, manipulating a ML model’s internal representations to predict and correct errors, and identifying simple internal circuits through approximating the ML model itself. These examples lay the groundwork for future efforts in automating ML model interpretation. Ph.D. 2024-08-21T18:53:21Z 2024-08-21T18:53:21Z 2024-05 2024-07-10T13:01:36.381Z Thesis https://hdl.handle.net/1721.1/156277 0000-0002-8876-1781 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Hernandez, Evan Automated Interpretation of Machine Learning Models
title	Automated Interpretation of Machine Learning Models
title_full	Automated Interpretation of Machine Learning Models
title_fullStr	Automated Interpretation of Machine Learning Models
title_full_unstemmed	Automated Interpretation of Machine Learning Models
title_short	Automated Interpretation of Machine Learning Models
title_sort	automated interpretation of machine learning models
url	https://hdl.handle.net/1721.1/156277
work_keys_str_mv	AT hernandezevan automatedinterpretationofmachinelearningmodels

Automated Interpretation of Machine Learning Models

Similar Items