Long Sequence Transformer Variants on Varying Context Length

Transformers are powerful and effective tools in natural language processing, but their scalability is limited by the quadratic complexity of attention. Several transformer variants that address this problem have recently been proposed, including Moving Average Equipped Gated Attention (Mega). In th...

Full description

Bibliographic Details
Main Author: Sun, Melinda
Other Authors: Kim, Yoon
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/152839