Mechanistic Interpretability for Progress Towards Quantitative AI Safety
In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interven...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/156748 |