Automated Mechanistic Interpretability for Neural Networks
Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/156787 |
_version_ | 1811082835371491328 |
---|---|
author | Liao, Isaac C. |
author2 | Tegmark, Max |
author_facet | Tegmark, Max Liao, Isaac C. |
author_sort | Liao, Isaac C. |
collection | MIT |
description | Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the interpretation process for neural networks that are too large to manually interpret. Firstly, we detect inherently multidimensional representations of data; we discover that large language models use circular representations to perform modular addition tasks. Secondly, we introduce methods to penalize complexity in neural circuitry; we discover the automatic emergence of interpretable properties such as sparsity, weight tying, and circuit duplication. Last but not least, we apply neural network symmetries to put networks into a simplified normal form, for conversion into human-readable python; we introduce a program synthesis benchmark with this and successfully convert 32 out of 62 of them. |
first_indexed | 2024-09-23T12:09:51Z |
format | Thesis |
id | mit-1721.1/156787 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T12:09:51Z |
publishDate | 2024 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1567872024-09-17T03:36:08Z Automated Mechanistic Interpretability for Neural Networks Liao, Isaac C. Tegmark, Max Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the interpretation process for neural networks that are too large to manually interpret. Firstly, we detect inherently multidimensional representations of data; we discover that large language models use circular representations to perform modular addition tasks. Secondly, we introduce methods to penalize complexity in neural circuitry; we discover the automatic emergence of interpretable properties such as sparsity, weight tying, and circuit duplication. Last but not least, we apply neural network symmetries to put networks into a simplified normal form, for conversion into human-readable python; we introduce a program synthesis benchmark with this and successfully convert 32 out of 62 of them. M.Eng. 2024-09-16T13:49:06Z 2024-09-16T13:49:06Z 2024-05 2024-07-11T14:36:47.288Z Thesis https://hdl.handle.net/1721.1/156787 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Liao, Isaac C. Automated Mechanistic Interpretability for Neural Networks |
title | Automated Mechanistic Interpretability for Neural Networks |
title_full | Automated Mechanistic Interpretability for Neural Networks |
title_fullStr | Automated Mechanistic Interpretability for Neural Networks |
title_full_unstemmed | Automated Mechanistic Interpretability for Neural Networks |
title_short | Automated Mechanistic Interpretability for Neural Networks |
title_sort | automated mechanistic interpretability for neural networks |
url | https://hdl.handle.net/1721.1/156787 |
work_keys_str_mv | AT liaoisaacc automatedmechanisticinterpretabilityforneuralnetworks |