Automated Mechanistic Interpretability for Neural Networks

Mechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the...

Full description

Bibliographic Details
Main Author: Liao, Isaac C.
Other Authors: Tegmark, Max
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156787