LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compat...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10348571/ |
_version_ | 1797376263155875840 |
---|---|
author | Le Zhou |
author_facet | Le Zhou |
author_sort | Le Zhou |
collection | DOAJ |
description | Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 8~16k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 16~48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency. |
first_indexed | 2024-03-08T19:36:59Z |
format | Article |
id | doaj.art-0c034170c69a4c89afbfb83b7665bd8d |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-08T19:36:59Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-0c034170c69a4c89afbfb83b7665bd8d2023-12-26T00:08:39ZengIEEEIEEE Access2169-35362023-01-011113843313844410.1109/ACCESS.2023.334085410348571LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer SequenceLe Zhou0https://orcid.org/0009-0005-2869-7844Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, ChinaEfficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 8~16k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 16~48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency.https://ieeexplore.ieee.org/document/10348571/Efficient transformerlong-range modelingnatural language processingsequence-to-sequence modeltext summarization |
spellingShingle | Le Zhou LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence IEEE Access Efficient transformer long-range modeling natural language processing sequence-to-sequence model text summarization |
title | LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence |
title_full | LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence |
title_fullStr | LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence |
title_full_unstemmed | LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence |
title_short | LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence |
title_sort | longt5 mulla longt5 with multi level local attention for a longer sequence |
topic | Efficient transformer long-range modeling natural language processing sequence-to-sequence model text summarization |
url | https://ieeexplore.ieee.org/document/10348571/ |
work_keys_str_mv | AT lezhou longt5mullalongt5withmultilevellocalattentionforalongersequence |