Automated Mechanistic Interpretability for Neural Networks

Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the...

Full description

Bibliographic Details
Main Author:	Liao, Isaac C.
Other Authors:	Tegmark, Max
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156787

_version_	1811082835371491328
author	Liao, Isaac C.
author2	Tegmark, Max
author_facet	Tegmark, Max Liao, Isaac C.
author_sort	Liao, Isaac C.
collection	MIT
description	Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the interpretation process for neural networks that are too large to manually interpret. Firstly, we detect inherently multidimensional representations of data; we discover that large language models use circular representations to perform modular addition tasks. Secondly, we introduce methods to penalize complexity in neural circuitry; we discover the automatic emergence of interpretable properties such as sparsity, weight tying, and circuit duplication. Last but not least, we apply neural network symmetries to put networks into a simplified normal form, for conversion into human-readable python; we introduce a program synthesis benchmark with this and successfully convert 32 out of 62 of them.
first_indexed	2024-09-23T12:09:51Z
format	Thesis
id	mit-1721.1/156787
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T12:09:51Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1567872024-09-17T03:36:08Z Automated Mechanistic Interpretability for Neural Networks Liao, Isaac C. Tegmark, Max Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the interpretation process for neural networks that are too large to manually interpret. Firstly, we detect inherently multidimensional representations of data; we discover that large language models use circular representations to perform modular addition tasks. Secondly, we introduce methods to penalize complexity in neural circuitry; we discover the automatic emergence of interpretable properties such as sparsity, weight tying, and circuit duplication. Last but not least, we apply neural network symmetries to put networks into a simplified normal form, for conversion into human-readable python; we introduce a program synthesis benchmark with this and successfully convert 32 out of 62 of them. M.Eng. 2024-09-16T13:49:06Z 2024-09-16T13:49:06Z 2024-05 2024-07-11T14:36:47.288Z Thesis https://hdl.handle.net/1721.1/156787 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Liao, Isaac C. Automated Mechanistic Interpretability for Neural Networks
title	Automated Mechanistic Interpretability for Neural Networks
title_full	Automated Mechanistic Interpretability for Neural Networks
title_fullStr	Automated Mechanistic Interpretability for Neural Networks
title_full_unstemmed	Automated Mechanistic Interpretability for Neural Networks
title_short	Automated Mechanistic Interpretability for Neural Networks
title_sort	automated mechanistic interpretability for neural networks
url	https://hdl.handle.net/1721.1/156787
work_keys_str_mv	AT liaoisaacc automatedmechanisticinterpretabilityforneuralnetworks

Automated Mechanistic Interpretability for Neural Networks

Similar Items