Understanding Concept Representations and their Transformations in Transformer Models
As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representa...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/151276 |
_version_ | 1826207420530032640 |
---|---|
author | Kearney, Matthew |
author2 | Andreas, Jacob |
author_facet | Andreas, Jacob Kearney, Matthew |
author_sort | Kearney, Matthew |
collection | MIT |
description | As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representation space and asks what features of the input these directions represent and how they evolve. While research using these methods has focused on finding and automating the label process, a prerequisite to this is first identifying which directions are the salient ones in the model’s computation. There exists theoretical arguments that the activations of the first layer of the multi-layer perceptrons (MLPs) in transformers are the salient basis for represent the information the model is using for computation. However, there currently do not exist any empirical studies comparing these internal representations to others that have been used in prior work. This research answers this question by comparing several directions in the internal representation space of transformers in terms of how well they represent basic linguistic concepts we expect the model to be using in computation. We find that the empirical evidence does support the theoretical arguments and that the first layer of the MLP modules is the most representative basis for these concepts. We further extend this exploration by examining the connections between MLP neurons and developing a method of determining which neurons have the potential of communicating information between one another. In the process we discover specialized neurons for erasing and preserving information in the model’s hidden state and characterize this phenomenon. |
first_indexed | 2024-09-23T13:49:43Z |
format | Thesis |
id | mit-1721.1/151276 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T13:49:43Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1512762023-08-01T03:31:32Z Understanding Concept Representations and their Transformations in Transformer Models Kearney, Matthew Andreas, Jacob Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representation space and asks what features of the input these directions represent and how they evolve. While research using these methods has focused on finding and automating the label process, a prerequisite to this is first identifying which directions are the salient ones in the model’s computation. There exists theoretical arguments that the activations of the first layer of the multi-layer perceptrons (MLPs) in transformers are the salient basis for represent the information the model is using for computation. However, there currently do not exist any empirical studies comparing these internal representations to others that have been used in prior work. This research answers this question by comparing several directions in the internal representation space of transformers in terms of how well they represent basic linguistic concepts we expect the model to be using in computation. We find that the empirical evidence does support the theoretical arguments and that the first layer of the MLP modules is the most representative basis for these concepts. We further extend this exploration by examining the connections between MLP neurons and developing a method of determining which neurons have the potential of communicating information between one another. In the process we discover specialized neurons for erasing and preserving information in the model’s hidden state and characterize this phenomenon. M.Eng. 2023-07-31T19:27:49Z 2023-07-31T19:27:49Z 2023-06 2023-06-06T16:34:44.934Z Thesis https://hdl.handle.net/1721.1/151276 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Kearney, Matthew Understanding Concept Representations and their Transformations in Transformer Models |
title | Understanding Concept Representations and their Transformations in Transformer Models |
title_full | Understanding Concept Representations and their Transformations in Transformer Models |
title_fullStr | Understanding Concept Representations and their Transformations in Transformer Models |
title_full_unstemmed | Understanding Concept Representations and their Transformations in Transformer Models |
title_short | Understanding Concept Representations and their Transformations in Transformer Models |
title_sort | understanding concept representations and their transformations in transformer models |
url | https://hdl.handle.net/1721.1/151276 |
work_keys_str_mv | AT kearneymatthew understandingconceptrepresentationsandtheirtransformationsintransformermodels |