Interpreting and Editing Memory in Large Transformer Language Models
This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/156794 |
_version_ | 1811070754263924736 |
---|---|
author | Meng, Kevin |
author2 | Andreas, Jacob D. |
author_facet | Andreas, Jacob D. Meng, Kevin |
author_sort | Meng, Kevin |
collection | MIT |
description | This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work. |
first_indexed | 2024-09-23T08:41:02Z |
format | Thesis |
id | mit-1721.1/156794 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T08:41:02Z |
publishDate | 2024 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1567942024-09-17T04:02:58Z Interpreting and Editing Memory in Large Transformer Language Models Meng, Kevin Andreas, Jacob D. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work. M.Eng. 2024-09-16T13:49:41Z 2024-09-16T13:49:41Z 2024-05 2024-07-11T14:36:44.224Z Thesis https://hdl.handle.net/1721.1/156794 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Meng, Kevin Interpreting and Editing Memory in Large Transformer Language Models |
title | Interpreting and Editing Memory in Large Transformer Language Models |
title_full | Interpreting and Editing Memory in Large Transformer Language Models |
title_fullStr | Interpreting and Editing Memory in Large Transformer Language Models |
title_full_unstemmed | Interpreting and Editing Memory in Large Transformer Language Models |
title_short | Interpreting and Editing Memory in Large Transformer Language Models |
title_sort | interpreting and editing memory in large transformer language models |
url | https://hdl.handle.net/1721.1/156794 |
work_keys_str_mv | AT mengkevin interpretingandeditingmemoryinlargetransformerlanguagemodels |