Understanding Concept Representations and their Transformations in Transformer Models

As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representa...

Full description

Bibliographic Details
Main Author: Kearney, Matthew
Other Authors: Andreas, Jacob
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/151276
_version_ 1826207420530032640
author Kearney, Matthew
author2 Andreas, Jacob
author_facet Andreas, Jacob
Kearney, Matthew
author_sort Kearney, Matthew
collection MIT
description As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representation space and asks what features of the input these directions represent and how they evolve. While research using these methods has focused on finding and automating the label process, a prerequisite to this is first identifying which directions are the salient ones in the model’s computation. There exists theoretical arguments that the activations of the first layer of the multi-layer perceptrons (MLPs) in transformers are the salient basis for represent the information the model is using for computation. However, there currently do not exist any empirical studies comparing these internal representations to others that have been used in prior work. This research answers this question by comparing several directions in the internal representation space of transformers in terms of how well they represent basic linguistic concepts we expect the model to be using in computation. We find that the empirical evidence does support the theoretical arguments and that the first layer of the MLP modules is the most representative basis for these concepts. We further extend this exploration by examining the connections between MLP neurons and developing a method of determining which neurons have the potential of communicating information between one another. In the process we discover specialized neurons for erasing and preserving information in the model’s hidden state and characterize this phenomenon.
first_indexed 2024-09-23T13:49:43Z
format Thesis
id mit-1721.1/151276
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T13:49:43Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1512762023-08-01T03:31:32Z Understanding Concept Representations and their Transformations in Transformer Models Kearney, Matthew Andreas, Jacob Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representation space and asks what features of the input these directions represent and how they evolve. While research using these methods has focused on finding and automating the label process, a prerequisite to this is first identifying which directions are the salient ones in the model’s computation. There exists theoretical arguments that the activations of the first layer of the multi-layer perceptrons (MLPs) in transformers are the salient basis for represent the information the model is using for computation. However, there currently do not exist any empirical studies comparing these internal representations to others that have been used in prior work. This research answers this question by comparing several directions in the internal representation space of transformers in terms of how well they represent basic linguistic concepts we expect the model to be using in computation. We find that the empirical evidence does support the theoretical arguments and that the first layer of the MLP modules is the most representative basis for these concepts. We further extend this exploration by examining the connections between MLP neurons and developing a method of determining which neurons have the potential of communicating information between one another. In the process we discover specialized neurons for erasing and preserving information in the model’s hidden state and characterize this phenomenon. M.Eng. 2023-07-31T19:27:49Z 2023-07-31T19:27:49Z 2023-06 2023-06-06T16:34:44.934Z Thesis https://hdl.handle.net/1721.1/151276 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Kearney, Matthew
Understanding Concept Representations and their Transformations in Transformer Models
title Understanding Concept Representations and their Transformations in Transformer Models
title_full Understanding Concept Representations and their Transformations in Transformer Models
title_fullStr Understanding Concept Representations and their Transformations in Transformer Models
title_full_unstemmed Understanding Concept Representations and their Transformations in Transformer Models
title_short Understanding Concept Representations and their Transformations in Transformer Models
title_sort understanding concept representations and their transformations in transformer models
url https://hdl.handle.net/1721.1/151276
work_keys_str_mv AT kearneymatthew understandingconceptrepresentationsandtheirtransformationsintransformermodels