Automated Mechanistic Interpretability for Neural Networks

Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the...

Full description

Bibliographic Details
Main Author: Liao, Isaac C.
Other Authors: Tegmark, Max
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156787
_version_ 1811082835371491328
author Liao, Isaac C.
author2 Tegmark, Max
author_facet Tegmark, Max
Liao, Isaac C.
author_sort Liao, Isaac C.
collection MIT
description Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the interpretation process for neural networks that are too large to manually interpret. Firstly, we detect inherently multidimensional representations of data; we discover that large language models use circular representations to perform modular addition tasks. Secondly, we introduce methods to penalize complexity in neural circuitry; we discover the automatic emergence of interpretable properties such as sparsity, weight tying, and circuit duplication. Last but not least, we apply neural network symmetries to put networks into a simplified normal form, for conversion into human-readable python; we introduce a program synthesis benchmark with this and successfully convert 32 out of 62 of them.
first_indexed 2024-09-23T12:09:51Z
format Thesis
id mit-1721.1/156787
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T12:09:51Z
publishDate 2024
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1567872024-09-17T03:36:08Z Automated Mechanistic Interpretability for Neural Networks Liao, Isaac C. Tegmark, Max Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the interpretation process for neural networks that are too large to manually interpret. Firstly, we detect inherently multidimensional representations of data; we discover that large language models use circular representations to perform modular addition tasks. Secondly, we introduce methods to penalize complexity in neural circuitry; we discover the automatic emergence of interpretable properties such as sparsity, weight tying, and circuit duplication. Last but not least, we apply neural network symmetries to put networks into a simplified normal form, for conversion into human-readable python; we introduce a program synthesis benchmark with this and successfully convert 32 out of 62 of them. M.Eng. 2024-09-16T13:49:06Z 2024-09-16T13:49:06Z 2024-05 2024-07-11T14:36:47.288Z Thesis https://hdl.handle.net/1721.1/156787 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Liao, Isaac C.
Automated Mechanistic Interpretability for Neural Networks
title Automated Mechanistic Interpretability for Neural Networks
title_full Automated Mechanistic Interpretability for Neural Networks
title_fullStr Automated Mechanistic Interpretability for Neural Networks
title_full_unstemmed Automated Mechanistic Interpretability for Neural Networks
title_short Automated Mechanistic Interpretability for Neural Networks
title_sort automated mechanistic interpretability for neural networks
url https://hdl.handle.net/1721.1/156787
work_keys_str_mv AT liaoisaacc automatedmechanisticinterpretabilityforneuralnetworks