Interpreting and Editing Memory in Large Transformer Language Models

This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the...

Full description

Bibliographic Details
Main Author:	Meng, Kevin
Other Authors:	Andreas, Jacob D.
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156794

_version_	1811070754263924736
author	Meng, Kevin
author2	Andreas, Jacob D.
author_facet	Andreas, Jacob D. Meng, Kevin
author_sort	Meng, Kevin
collection	MIT
description	This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work.
first_indexed	2024-09-23T08:41:02Z
format	Thesis
id	mit-1721.1/156794
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T08:41:02Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1567942024-09-17T04:02:58Z Interpreting and Editing Memory in Large Transformer Language Models Meng, Kevin Andreas, Jacob D. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work. M.Eng. 2024-09-16T13:49:41Z 2024-09-16T13:49:41Z 2024-05 2024-07-11T14:36:44.224Z Thesis https://hdl.handle.net/1721.1/156794 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Meng, Kevin Interpreting and Editing Memory in Large Transformer Language Models
title	Interpreting and Editing Memory in Large Transformer Language Models
title_full	Interpreting and Editing Memory in Large Transformer Language Models
title_fullStr	Interpreting and Editing Memory in Large Transformer Language Models
title_full_unstemmed	Interpreting and Editing Memory in Large Transformer Language Models
title_short	Interpreting and Editing Memory in Large Transformer Language Models
title_sort	interpreting and editing memory in large transformer language models
url	https://hdl.handle.net/1721.1/156794
work_keys_str_mv	AT mengkevin interpretingandeditingmemoryinlargetransformerlanguagemodels

Interpreting and Editing Memory in Large Transformer Language Models

Similar Items