Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

AbstractThis work presents a linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly m...

Full description

Bibliographic Details
Main Authors: Byung-Doh Oh, William Schuler
Format: Article
Language:English
Published: The MIT Press 2023-01-01
Series:Transactions of the Association for Computational Linguistics
Online Access:https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00548/115371/Why-Does-Surprisal-From-Larger-Transformer-Based
_version_ 1797796612080140288
author Byung-Doh Oh
William Schuler
author_facet Byung-Doh Oh
William Schuler
author_sort Byung-Doh Oh
collection DOAJ
description AbstractThis work presents a linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to ‘memorize’ sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.
first_indexed 2024-03-13T03:36:39Z
format Article
id doaj.art-a4dd002a602a4232964e814f2dc3e096
institution Directory Open Access Journal
issn 2307-387X
language English
last_indexed 2024-03-13T03:36:39Z
publishDate 2023-01-01
publisher The MIT Press
record_format Article
series Transactions of the Association for Computational Linguistics
spelling doaj.art-a4dd002a602a4232964e814f2dc3e0962023-06-23T18:58:55ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2023-01-011133635010.1162/tacl_a_00548Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?Byung-Doh Oh0William Schuler1Department of Linguistics, The Ohio State University, USA. oh.531@osu.eduDepartment of Linguistics, The Ohio State University, USA. schuler.77@osu.edu AbstractThis work presents a linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to ‘memorize’ sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00548/115371/Why-Does-Surprisal-From-Larger-Transformer-Based
spellingShingle Byung-Doh Oh
William Schuler
Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?
Transactions of the Association for Computational Linguistics
title Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?
title_full Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?
title_fullStr Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?
title_full_unstemmed Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?
title_short Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?
title_sort why does surprisal from larger transformer based language models provide a poorer fit to human reading times
url https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00548/115371/Why-Does-Surprisal-From-Larger-Transformer-Based
work_keys_str_mv AT byungdohoh whydoessurprisalfromlargertransformerbasedlanguagemodelsprovideapoorerfittohumanreadingtimes
AT williamschuler whydoessurprisalfromlargertransformerbasedlanguagemodelsprovideapoorerfittohumanreadingtimes