Interpreting and Editing Memory in Large Transformer Language Models

This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the...

Full description

Bibliographic Details
Main Author: Meng, Kevin
Other Authors: Andreas, Jacob D.
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156794
_version_ 1811070754263924736
author Meng, Kevin
author2 Andreas, Jacob D.
author_facet Andreas, Jacob D.
Meng, Kevin
author_sort Meng, Kevin
collection MIT
description This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work.
first_indexed 2024-09-23T08:41:02Z
format Thesis
id mit-1721.1/156794
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T08:41:02Z
publishDate 2024
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1567942024-09-17T04:02:58Z Interpreting and Editing Memory in Large Transformer Language Models Meng, Kevin Andreas, Jacob D. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work. M.Eng. 2024-09-16T13:49:41Z 2024-09-16T13:49:41Z 2024-05 2024-07-11T14:36:44.224Z Thesis https://hdl.handle.net/1721.1/156794 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Meng, Kevin
Interpreting and Editing Memory in Large Transformer Language Models
title Interpreting and Editing Memory in Large Transformer Language Models
title_full Interpreting and Editing Memory in Large Transformer Language Models
title_fullStr Interpreting and Editing Memory in Large Transformer Language Models
title_full_unstemmed Interpreting and Editing Memory in Large Transformer Language Models
title_short Interpreting and Editing Memory in Large Transformer Language Models
title_sort interpreting and editing memory in large transformer language models
url https://hdl.handle.net/1721.1/156794
work_keys_str_mv AT mengkevin interpretingandeditingmemoryinlargetransformerlanguagemodels