Mechanistic Interpretability for Progress Towards Quantitative AI Safety

In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interven...

Full description

Bibliographic Details
Main Author: Lad, Vedang K.
Other Authors: Tegmark, Max
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156748