Mechanistic Interpretability for Progress Towards Quantitative AI Safety

In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interven...

全面介紹

書目詳細資料
主要作者: Lad, Vedang K.
其他作者: Tegmark, Max
格式: Thesis
出版: Massachusetts Institute of Technology 2024
在線閱讀:https://hdl.handle.net/1721.1/156748
實物特徵
總結:In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interventions such as zero-ablations and layer swapping, revealing that these models maintain high accuracy despite perturbations. As a result, we hypothesize the stages of inference in LLMs. This work suggests implications for LLM dataset curation, model optimization, and quantization. Subsequently, we introduce MIPS, an innovative method for program synthesis that distills the operational logic of neural networks into executable Python code. By transforming an RNN into a finite state machine and applying symbolic regression, MIPS successfully addresses 32 out of 62 algorithmic tasks, outperforming GPT-4 in 13 unique challenges. The work intends to take a step forward in enhancing the interpretability and reliability of AI systems, promising significant advances in our understanding and utilization of current and future AI capabilities. Together, these studies highlight the importance of comprehending the inferential behaviors of neural networks to foster more interpretable and efficient AI.