Understanding Concept Representations and their Transformations in Transformer Models

As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representa...

Full description

Bibliographic Details
Main Author:	Kearney, Matthew
Other Authors:	Andreas, Jacob
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/151276

_version_	1826207420530032640
author	Kearney, Matthew
author2	Andreas, Jacob
author_facet	Andreas, Jacob Kearney, Matthew
author_sort	Kearney, Matthew
collection	MIT
description	As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representation space and asks what features of the input these directions represent and how they evolve. While research using these methods has focused on finding and automating the label process, a prerequisite to this is first identifying which directions are the salient ones in the model’s computation. There exists theoretical arguments that the activations of the first layer of the multi-layer perceptrons (MLPs) in transformers are the salient basis for represent the information the model is using for computation. However, there currently do not exist any empirical studies comparing these internal representations to others that have been used in prior work. This research answers this question by comparing several directions in the internal representation space of transformers in terms of how well they represent basic linguistic concepts we expect the model to be using in computation. We find that the empirical evidence does support the theoretical arguments and that the first layer of the MLP modules is the most representative basis for these concepts. We further extend this exploration by examining the connections between MLP neurons and developing a method of determining which neurons have the potential of communicating information between one another. In the process we discover specialized neurons for erasing and preserving information in the model’s hidden state and characterize this phenomenon.
first_indexed	2024-09-23T13:49:43Z
format	Thesis
id	mit-1721.1/151276
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T13:49:43Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1512762023-08-01T03:31:32Z Understanding Concept Representations and their Transformations in Transformer Models Kearney, Matthew Andreas, Jacob Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representation space and asks what features of the input these directions represent and how they evolve. While research using these methods has focused on finding and automating the label process, a prerequisite to this is first identifying which directions are the salient ones in the model’s computation. There exists theoretical arguments that the activations of the first layer of the multi-layer perceptrons (MLPs) in transformers are the salient basis for represent the information the model is using for computation. However, there currently do not exist any empirical studies comparing these internal representations to others that have been used in prior work. This research answers this question by comparing several directions in the internal representation space of transformers in terms of how well they represent basic linguistic concepts we expect the model to be using in computation. We find that the empirical evidence does support the theoretical arguments and that the first layer of the MLP modules is the most representative basis for these concepts. We further extend this exploration by examining the connections between MLP neurons and developing a method of determining which neurons have the potential of communicating information between one another. In the process we discover specialized neurons for erasing and preserving information in the model’s hidden state and characterize this phenomenon. M.Eng. 2023-07-31T19:27:49Z 2023-07-31T19:27:49Z 2023-06 2023-06-06T16:34:44.934Z Thesis https://hdl.handle.net/1721.1/151276 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Kearney, Matthew Understanding Concept Representations and their Transformations in Transformer Models
title	Understanding Concept Representations and their Transformations in Transformer Models
title_full	Understanding Concept Representations and their Transformations in Transformer Models
title_fullStr	Understanding Concept Representations and their Transformations in Transformer Models
title_full_unstemmed	Understanding Concept Representations and their Transformations in Transformer Models
title_short	Understanding Concept Representations and their Transformations in Transformer Models
title_sort	understanding concept representations and their transformations in transformer models
url	https://hdl.handle.net/1721.1/151276
work_keys_str_mv	AT kearneymatthew understandingconceptrepresentationsandtheirtransformationsintransformermodels

Understanding Concept Representations and their Transformations in Transformer Models

Similar Items