LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence

Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compat...

Full description

Bibliographic Details
Main Author: Le Zhou
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10348571/
_version_ 1797376263155875840
author Le Zhou
author_facet Le Zhou
author_sort Le Zhou
collection DOAJ
description Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 8~16k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 16~48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency.
first_indexed 2024-03-08T19:36:59Z
format Article
id doaj.art-0c034170c69a4c89afbfb83b7665bd8d
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-08T19:36:59Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-0c034170c69a4c89afbfb83b7665bd8d2023-12-26T00:08:39ZengIEEEIEEE Access2169-35362023-01-011113843313844410.1109/ACCESS.2023.334085410348571LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer SequenceLe Zhou0https://orcid.org/0009-0005-2869-7844Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, ChinaEfficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 8~16k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 16~48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency.https://ieeexplore.ieee.org/document/10348571/Efficient transformerlong-range modelingnatural language processingsequence-to-sequence modeltext summarization
spellingShingle Le Zhou
LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
IEEE Access
Efficient transformer
long-range modeling
natural language processing
sequence-to-sequence model
text summarization
title LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_full LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_fullStr LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_full_unstemmed LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_short LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_sort longt5 mulla longt5 with multi level local attention for a longer sequence
topic Efficient transformer
long-range modeling
natural language processing
sequence-to-sequence model
text summarization
url https://ieeexplore.ieee.org/document/10348571/
work_keys_str_mv AT lezhou longt5mullalongt5withmultilevellocalattentionforalongersequence