Mechanistic Interpretability for Progress Towards Quantitative AI Safety

In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interven...

Full description

Bibliographic Details
Main Author:	Lad, Vedang K.
Other Authors:	Tegmark, Max
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156748

_version_	1826204738880798720
author	Lad, Vedang K.
author2	Tegmark, Max
author_facet	Tegmark, Max Lad, Vedang K.
author_sort	Lad, Vedang K.
collection	MIT
description	In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interventions such as zero-ablations and layer swapping, revealing that these models maintain high accuracy despite perturbations. As a result, we hypothesize the stages of inference in LLMs. This work suggests implications for LLM dataset curation, model optimization, and quantization. Subsequently, we introduce MIPS, an innovative method for program synthesis that distills the operational logic of neural networks into executable Python code. By transforming an RNN into a finite state machine and applying symbolic regression, MIPS successfully addresses 32 out of 62 algorithmic tasks, outperforming GPT-4 in 13 unique challenges. The work intends to take a step forward in enhancing the interpretability and reliability of AI systems, promising significant advances in our understanding and utilization of current and future AI capabilities. Together, these studies highlight the importance of comprehending the inferential behaviors of neural networks to foster more interpretable and efficient AI.
first_indexed	2024-09-23T13:00:37Z
format	Thesis
id	mit-1721.1/156748
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T13:00:37Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1567482024-09-17T03:32:25Z Mechanistic Interpretability for Progress Towards Quantitative AI Safety Lad, Vedang K. Tegmark, Max Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In this thesis, we conduct a detailed investigation into the dynamics of neural networks, focusing on two key areas: inference stages in large language models (LLMs) and novel program synthesis methods using mechanistic interpretability. We explore the robustness of LLMs through layer-level interventions such as zero-ablations and layer swapping, revealing that these models maintain high accuracy despite perturbations. As a result, we hypothesize the stages of inference in LLMs. This work suggests implications for LLM dataset curation, model optimization, and quantization. Subsequently, we introduce MIPS, an innovative method for program synthesis that distills the operational logic of neural networks into executable Python code. By transforming an RNN into a finite state machine and applying symbolic regression, MIPS successfully addresses 32 out of 62 algorithmic tasks, outperforming GPT-4 in 13 unique challenges. The work intends to take a step forward in enhancing the interpretability and reliability of AI systems, promising significant advances in our understanding and utilization of current and future AI capabilities. Together, these studies highlight the importance of comprehending the inferential behaviors of neural networks to foster more interpretable and efficient AI. M.Eng. 2024-09-16T13:46:46Z 2024-09-16T13:46:46Z 2024-05 2024-07-11T14:36:20.320Z Thesis https://hdl.handle.net/1721.1/156748 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Lad, Vedang K. Mechanistic Interpretability for Progress Towards Quantitative AI Safety
title	Mechanistic Interpretability for Progress Towards Quantitative AI Safety
title_full	Mechanistic Interpretability for Progress Towards Quantitative AI Safety
title_fullStr	Mechanistic Interpretability for Progress Towards Quantitative AI Safety
title_full_unstemmed	Mechanistic Interpretability for Progress Towards Quantitative AI Safety
title_short	Mechanistic Interpretability for Progress Towards Quantitative AI Safety
title_sort	mechanistic interpretability for progress towards quantitative ai safety
url	https://hdl.handle.net/1721.1/156748
work_keys_str_mv	AT ladvedangk mechanisticinterpretabilityforprogresstowardsquantitativeaisafety

Mechanistic Interpretability for Progress Towards Quantitative AI Safety

Similar Items