Mechanistic Interpretability for Progress Towards Quantitative AI Safety

In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interven...

Full description

Bibliographic Details
Main Author:	Lad, Vedang K.
Other Authors:	Tegmark, Max
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156748

Internet

https://hdl.handle.net/1721.1/156748

Mechanistic Interpretability for Progress Towards Quantitative AI Safety

Internet

Similar Items