DANet: Temporal Action Localization with Double Attention

Temporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attentio...

Full description

Bibliographic Details
Main Authors: Jianing Sun, Xuan Wu, Yubin Xiao, Chunguo Wu, Yanchun Liang, Yi Liang, Liupu Wang, You Zhou
Format: Article
Language:English
Published: MDPI AG 2023-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/12/7176
_version_ 1797596218973487104
author Jianing Sun
Xuan Wu
Yubin Xiao
Chunguo Wu
Yanchun Liang
Yi Liang
Liupu Wang
You Zhou
author_facet Jianing Sun
Xuan Wu
Yubin Xiao
Chunguo Wu
Yanchun Liang
Yi Liang
Liupu Wang
You Zhou
author_sort Jianing Sun
collection DOAJ
description Temporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attention mechanisms tend to average embeddings, thereby reducing the preservation of critical features. To solve these two problems better, we propose two kinds of attention mechanisms, namely multi-headed local self-attention (MLSA) and max-average pooling attention (MA) to extract simultaneously local and global features. In MA, max-pooling is used to select the most critical information from local clip embeddings instead of averaging embeddings, and average-pooling is used to aggregate global features. We use MLSA for modeling local temporal context. In addition, to enhance collaboration between MA and MLSA, we propose the double attention block (DABlock), comprising MA and MLSA. Finally, we propose the final network double attention network (DANet), composed of DABlocks and other advanced blocks. To evaluate DANet’s performance, we conduct extensive experiments for the TAL task. Experimental results demonstrate that DANet outperforms the other state-of-the-art models on all datasets. Finally, ablation studies demonstrate the effectiveness of the proposed MLSA and MA. Compared with structures using backbone with convolution and global Transformer, DABlock consisting of MLSA and MA has a superior performance, achieving an 8% and 0.5% improvement on overall average mAP, respectively.
first_indexed 2024-03-11T02:48:30Z
format Article
id doaj.art-26fa63a8976c4b95adf58f1de2a3d6e0
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T02:48:30Z
publishDate 2023-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-26fa63a8976c4b95adf58f1de2a3d6e02023-11-18T09:10:12ZengMDPI AGApplied Sciences2076-34172023-06-011312717610.3390/app13127176DANet: Temporal Action Localization with Double AttentionJianing Sun0Xuan Wu1Yubin Xiao2Chunguo Wu3Yanchun Liang4Yi Liang5Liupu Wang6You Zhou7Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaSchool of Computer Science, Zhuhai College of Science and Technology, Zhuhai 519041, ChinaCollege of Business and Administration, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaTemporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attention mechanisms tend to average embeddings, thereby reducing the preservation of critical features. To solve these two problems better, we propose two kinds of attention mechanisms, namely multi-headed local self-attention (MLSA) and max-average pooling attention (MA) to extract simultaneously local and global features. In MA, max-pooling is used to select the most critical information from local clip embeddings instead of averaging embeddings, and average-pooling is used to aggregate global features. We use MLSA for modeling local temporal context. In addition, to enhance collaboration between MA and MLSA, we propose the double attention block (DABlock), comprising MA and MLSA. Finally, we propose the final network double attention network (DANet), composed of DABlocks and other advanced blocks. To evaluate DANet’s performance, we conduct extensive experiments for the TAL task. Experimental results demonstrate that DANet outperforms the other state-of-the-art models on all datasets. Finally, ablation studies demonstrate the effectiveness of the proposed MLSA and MA. Compared with structures using backbone with convolution and global Transformer, DABlock consisting of MLSA and MA has a superior performance, achieving an 8% and 0.5% improvement on overall average mAP, respectively.https://www.mdpi.com/2076-3417/13/12/7176temporal action localizationcomputer visionartificial intelligenceattention mechanism
spellingShingle Jianing Sun
Xuan Wu
Yubin Xiao
Chunguo Wu
Yanchun Liang
Yi Liang
Liupu Wang
You Zhou
DANet: Temporal Action Localization with Double Attention
Applied Sciences
temporal action localization
computer vision
artificial intelligence
attention mechanism
title DANet: Temporal Action Localization with Double Attention
title_full DANet: Temporal Action Localization with Double Attention
title_fullStr DANet: Temporal Action Localization with Double Attention
title_full_unstemmed DANet: Temporal Action Localization with Double Attention
title_short DANet: Temporal Action Localization with Double Attention
title_sort danet temporal action localization with double attention
topic temporal action localization
computer vision
artificial intelligence
attention mechanism
url https://www.mdpi.com/2076-3417/13/12/7176
work_keys_str_mv AT jianingsun danettemporalactionlocalizationwithdoubleattention
AT xuanwu danettemporalactionlocalizationwithdoubleattention
AT yubinxiao danettemporalactionlocalizationwithdoubleattention
AT chunguowu danettemporalactionlocalizationwithdoubleattention
AT yanchunliang danettemporalactionlocalizationwithdoubleattention
AT yiliang danettemporalactionlocalizationwithdoubleattention
AT liupuwang danettemporalactionlocalizationwithdoubleattention
AT youzhou danettemporalactionlocalizationwithdoubleattention