DANet: Temporal Action Localization with Double Attention
Temporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attentio...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-06-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/12/7176 |
_version_ | 1797596218973487104 |
---|---|
author | Jianing Sun Xuan Wu Yubin Xiao Chunguo Wu Yanchun Liang Yi Liang Liupu Wang You Zhou |
author_facet | Jianing Sun Xuan Wu Yubin Xiao Chunguo Wu Yanchun Liang Yi Liang Liupu Wang You Zhou |
author_sort | Jianing Sun |
collection | DOAJ |
description | Temporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attention mechanisms tend to average embeddings, thereby reducing the preservation of critical features. To solve these two problems better, we propose two kinds of attention mechanisms, namely multi-headed local self-attention (MLSA) and max-average pooling attention (MA) to extract simultaneously local and global features. In MA, max-pooling is used to select the most critical information from local clip embeddings instead of averaging embeddings, and average-pooling is used to aggregate global features. We use MLSA for modeling local temporal context. In addition, to enhance collaboration between MA and MLSA, we propose the double attention block (DABlock), comprising MA and MLSA. Finally, we propose the final network double attention network (DANet), composed of DABlocks and other advanced blocks. To evaluate DANet’s performance, we conduct extensive experiments for the TAL task. Experimental results demonstrate that DANet outperforms the other state-of-the-art models on all datasets. Finally, ablation studies demonstrate the effectiveness of the proposed MLSA and MA. Compared with structures using backbone with convolution and global Transformer, DABlock consisting of MLSA and MA has a superior performance, achieving an 8% and 0.5% improvement on overall average mAP, respectively. |
first_indexed | 2024-03-11T02:48:30Z |
format | Article |
id | doaj.art-26fa63a8976c4b95adf58f1de2a3d6e0 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-11T02:48:30Z |
publishDate | 2023-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-26fa63a8976c4b95adf58f1de2a3d6e02023-11-18T09:10:12ZengMDPI AGApplied Sciences2076-34172023-06-011312717610.3390/app13127176DANet: Temporal Action Localization with Double AttentionJianing Sun0Xuan Wu1Yubin Xiao2Chunguo Wu3Yanchun Liang4Yi Liang5Liupu Wang6You Zhou7Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaSchool of Computer Science, Zhuhai College of Science and Technology, Zhuhai 519041, ChinaCollege of Business and Administration, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaLaboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaTemporal action localization (TAL) aims to predict action instance categories in videos and identify their start and end times. However, existing Transformer-based backbones focus only on global or local features, resulting in the loss of information. In addition, both global and local self-attention mechanisms tend to average embeddings, thereby reducing the preservation of critical features. To solve these two problems better, we propose two kinds of attention mechanisms, namely multi-headed local self-attention (MLSA) and max-average pooling attention (MA) to extract simultaneously local and global features. In MA, max-pooling is used to select the most critical information from local clip embeddings instead of averaging embeddings, and average-pooling is used to aggregate global features. We use MLSA for modeling local temporal context. In addition, to enhance collaboration between MA and MLSA, we propose the double attention block (DABlock), comprising MA and MLSA. Finally, we propose the final network double attention network (DANet), composed of DABlocks and other advanced blocks. To evaluate DANet’s performance, we conduct extensive experiments for the TAL task. Experimental results demonstrate that DANet outperforms the other state-of-the-art models on all datasets. Finally, ablation studies demonstrate the effectiveness of the proposed MLSA and MA. Compared with structures using backbone with convolution and global Transformer, DABlock consisting of MLSA and MA has a superior performance, achieving an 8% and 0.5% improvement on overall average mAP, respectively.https://www.mdpi.com/2076-3417/13/12/7176temporal action localizationcomputer visionartificial intelligenceattention mechanism |
spellingShingle | Jianing Sun Xuan Wu Yubin Xiao Chunguo Wu Yanchun Liang Yi Liang Liupu Wang You Zhou DANet: Temporal Action Localization with Double Attention Applied Sciences temporal action localization computer vision artificial intelligence attention mechanism |
title | DANet: Temporal Action Localization with Double Attention |
title_full | DANet: Temporal Action Localization with Double Attention |
title_fullStr | DANet: Temporal Action Localization with Double Attention |
title_full_unstemmed | DANet: Temporal Action Localization with Double Attention |
title_short | DANet: Temporal Action Localization with Double Attention |
title_sort | danet temporal action localization with double attention |
topic | temporal action localization computer vision artificial intelligence attention mechanism |
url | https://www.mdpi.com/2076-3417/13/12/7176 |
work_keys_str_mv | AT jianingsun danettemporalactionlocalizationwithdoubleattention AT xuanwu danettemporalactionlocalizationwithdoubleattention AT yubinxiao danettemporalactionlocalizationwithdoubleattention AT chunguowu danettemporalactionlocalizationwithdoubleattention AT yanchunliang danettemporalactionlocalizationwithdoubleattention AT yiliang danettemporalactionlocalizationwithdoubleattention AT liupuwang danettemporalactionlocalizationwithdoubleattention AT youzhou danettemporalactionlocalizationwithdoubleattention |