DANet: Temporal Action Localization with Double Attention

Temporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attentio...

Full description

Bibliographic Details
Main Authors:	Jianing Sun, Xuan Wu, Yubin Xiao, Chunguo Wu, Yanchun Liang, Yi Liang, Liupu Wang, You Zhou
Format:	Article
Language:	English
Published:	MDPI AG 2023-06-01
Series:	Applied Sciences
Subjects:	temporal action localization computer vision artificial intelligence attention mechanism
Online Access:	https://www.mdpi.com/2076-3417/13/12/7176

_version_	1797596218973487104
author	Jianing Sun Xuan Wu Yubin Xiao Chunguo Wu Yanchun Liang Yi Liang Liupu Wang You Zhou
author_facet	Jianing Sun Xuan Wu Yubin Xiao Chunguo Wu Yanchun Liang Yi Liang Liupu Wang You Zhou
author_sort	Jianing Sun
collection	DOAJ
description	Temporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attention mechanisms tend to average embeddings, thereby reducing the preservation of critical features. To solve these two problems better, we propose two kinds of attention mechanisms, namely multi-headed local self-attention (MLSA) and max-average pooling attention (MA) to extract simultaneously local and global features. In MA, max-pooling is used to select the most critical information from local clip embeddings instead of averaging embeddings, and average-pooling is used to aggregate global features. We use MLSA for modeling local temporal context. In addition, to enhance collaboration between MA and MLSA, we propose the double attention block (DABlock), comprising MA and MLSA. Finally, we propose the final network double attention network (DANet), composed of DABlocks and other advanced blocks. To evaluate DANet’s performance, we conduct extensive experiments for the TAL task. Experimental results demonstrate that DANet outperforms the other state-of-the-art models on all datasets. Finally, ablation studies demonstrate the effectiveness of the proposed MLSA and MA. Compared with structures using backbone with convolution and global Transformer, DABlock consisting of MLSA and MA has a superior performance, achieving an 8% and 0.5% improvement on overall average mAP, respectively.
first_indexed	2024-03-11T02:48:30Z
format	Article
id	doaj.art-26fa63a8976c4b95adf58f1de2a3d6e0
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T02:48:30Z
publishDate	2023-06-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-26fa63a8976c4b95adf58f1de2a3d6e02023-11-18T09:10:12ZengMDPI AGApplied Sciences2076-34172023-06-011312717610.3390/app13127176DANet: Temporal Action Localization with Double AttentionJianing Sun0Xuan Wu1Yubin Xiao2Chunguo Wu3Yanchun Liang4Yi Liang5Liupu Wang6You Zhou7Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaSchool of Computer Science, Zhuhai College of Science and Technology, Zhuhai 519041, ChinaCollege of Business and Administration, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaTemporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attention mechanisms tend to average embeddings, thereby reducing the preservation of critical features. To solve these two problems better, we propose two kinds of attention mechanisms, namely multi-headed local self-attention (MLSA) and max-average pooling attention (MA) to extract simultaneously local and global features. In MA, max-pooling is used to select the most critical information from local clip embeddings instead of averaging embeddings, and average-pooling is used to aggregate global features. We use MLSA for modeling local temporal context. In addition, to enhance collaboration between MA and MLSA, we propose the double attention block (DABlock), comprising MA and MLSA. Finally, we propose the final network double attention network (DANet), composed of DABlocks and other advanced blocks. To evaluate DANet’s performance, we conduct extensive experiments for the TAL task. Experimental results demonstrate that DANet outperforms the other state-of-the-art models on all datasets. Finally, ablation studies demonstrate the effectiveness of the proposed MLSA and MA. Compared with structures using backbone with convolution and global Transformer, DABlock consisting of MLSA and MA has a superior performance, achieving an 8% and 0.5% improvement on overall average mAP, respectively.https://www.mdpi.com/2076-3417/13/12/7176temporal action localizationcomputer visionartificial intelligenceattention mechanism
spellingShingle	Jianing Sun Xuan Wu Yubin Xiao Chunguo Wu Yanchun Liang Yi Liang Liupu Wang You Zhou DANet: Temporal Action Localization with Double Attention Applied Sciences temporal action localization computer vision artificial intelligence attention mechanism
title	DANet: Temporal Action Localization with Double Attention
title_full	DANet: Temporal Action Localization with Double Attention
title_fullStr	DANet: Temporal Action Localization with Double Attention
title_full_unstemmed	DANet: Temporal Action Localization with Double Attention
title_short	DANet: Temporal Action Localization with Double Attention
title_sort	danet temporal action localization with double attention
topic	temporal action localization computer vision artificial intelligence attention mechanism
url	https://www.mdpi.com/2076-3417/13/12/7176
work_keys_str_mv	AT jianingsun danettemporalactionlocalizationwithdoubleattention AT xuanwu danettemporalactionlocalizationwithdoubleattention AT yubinxiao danettemporalactionlocalizationwithdoubleattention AT chunguowu danettemporalactionlocalizationwithdoubleattention AT yanchunliang danettemporalactionlocalizationwithdoubleattention AT yiliang danettemporalactionlocalizationwithdoubleattention AT liupuwang danettemporalactionlocalizationwithdoubleattention AT youzhou danettemporalactionlocalizationwithdoubleattention

DANet: Temporal Action Localization with Double Attention

Similar Items