LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence

Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compat...

Full description

Bibliographic Details
Main Author:	Le Zhou
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Efficient transformer long-range modeling natural language processing sequence-to-sequence model text summarization
Online Access:	https://ieeexplore.ieee.org/document/10348571/

_version_	1827398933733703680
author	Le Zhou
author_facet	Le Zhou
author_sort	Le Zhou
collection	DOAJ
description	Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 8~16k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 16~48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency.
first_indexed	2024-03-08T19:36:59Z
format	Article
id	doaj.art-0c034170c69a4c89afbfb83b7665bd8d
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-08T19:36:59Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-0c034170c69a4c89afbfb83b7665bd8d2023-12-26T00:08:39ZengIEEEIEEE Access2169-35362023-01-011113843313844410.1109/ACCESS.2023.334085410348571LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer SequenceLe Zhou0https://orcid.org/0009-0005-2869-7844Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, ChinaEfficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 8~16k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 16~48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency.https://ieeexplore.ieee.org/document/10348571/Efficient transformerlong-range modelingnatural language processingsequence-to-sequence modeltext summarization
spellingShingle	Le Zhou LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence IEEE Access Efficient transformer long-range modeling natural language processing sequence-to-sequence model text summarization
title	LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_full	LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_fullStr	LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_full_unstemmed	LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_short	LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence
title_sort	longt5 mulla longt5 with multi level local attention for a longer sequence
topic	Efficient transformer long-range modeling natural language processing sequence-to-sequence model text summarization
url	https://ieeexplore.ieee.org/document/10348571/
work_keys_str_mv	AT lezhou longt5mullalongt5withmultilevellocalattentionforalongersequence

LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence

Similar Items