GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection

As the core component of video analysis, Temporal Action Localization (TAL) has experienced remarkable success. However, some issues are not well addressed. First, most of the existing methods process the local context individually, without explicitly exploiting the relations between features in an...

Full description

Bibliographic Details
Main Authors:	Yilong He, Yong Zhong, Lishun Wang, Jiachen Dang
Format:	Article
Language:	English
Published:	MDPI AG 2022-08-01
Series:	Applied Sciences
Subjects:	temporal action detection computer vision deep learning artificial intelligence
Online Access:	https://www.mdpi.com/2076-3417/12/17/8557

_version_	1797496428256296960
author	Yilong He Yong Zhong Lishun Wang Jiachen Dang
author_facet	Yilong He Yong Zhong Lishun Wang Jiachen Dang
author_sort	Yilong He
collection	DOAJ
description	As the core component of video analysis, Temporal Action Localization (TAL) has experienced remarkable success. However, some issues are not well addressed. First, most of the existing methods process the local context individually, without explicitly exploiting the relations between features in an action instance as a whole. Second, the duration of different actions varies widely; thus, it is difficult to choose the proper temporal receptive field. To address these issues, this paper proposes a novel network, GLFormer, which can aggregate short, medium, and long temporal contexts. Our method consists of three independent branches with different ranges of attention, and these features are then concatenated along the temporal dimension to obtain richer features. One is multi-scale local convolution (MLC), which consists of multiple 1D convolutions with varying kernel sizes to capture the multi-scale context information. Another is window self-attention (WSA), which tries to explore the relationship between features within the window range. The last is global attention (GA), which is used to establish long-range dependencies across the full sequence. Moreover, we design a feature pyramid structure to be compatible with action instances of various durations. GLFormer achieves state-of-the-art performance on two challenging video benchmarks, THUMOS14 and ActivityNet 1.3. Our performance is 67.2% and 54.5% AP@0.5 on the datasets THUMOS14 and ActivityNet 1.3, respectively.
first_indexed	2024-03-10T03:03:29Z
format	Article
id	doaj.art-2ebd48be34194074b76b2a6e80d4efd6
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T03:03:29Z
publishDate	2022-08-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-2ebd48be34194074b76b2a6e80d4efd62023-11-23T12:41:42ZengMDPI AGApplied Sciences2076-34172022-08-011217855710.3390/app12178557GLFormer: Global and Local Context Aggregation Network for Temporal Action DetectionYilong He0Yong Zhong1Lishun Wang2Jiachen Dang3Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu 610081, ChinaChengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu 610081, ChinaChengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu 610081, ChinaChengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu 610081, ChinaAs the core component of video analysis, Temporal Action Localization (TAL) has experienced remarkable success. However, some issues are not well addressed. First, most of the existing methods process the local context individually, without explicitly exploiting the relations between features in an action instance as a whole. Second, the duration of different actions varies widely; thus, it is difficult to choose the proper temporal receptive field. To address these issues, this paper proposes a novel network, GLFormer, which can aggregate short, medium, and long temporal contexts. Our method consists of three independent branches with different ranges of attention, and these features are then concatenated along the temporal dimension to obtain richer features. One is multi-scale local convolution (MLC), which consists of multiple 1D convolutions with varying kernel sizes to capture the multi-scale context information. Another is window self-attention (WSA), which tries to explore the relationship between features within the window range. The last is global attention (GA), which is used to establish long-range dependencies across the full sequence. Moreover, we design a feature pyramid structure to be compatible with action instances of various durations. GLFormer achieves state-of-the-art performance on two challenging video benchmarks, THUMOS14 and ActivityNet 1.3. Our performance is 67.2% and 54.5% AP@0.5 on the datasets THUMOS14 and ActivityNet 1.3, respectively.https://www.mdpi.com/2076-3417/12/17/8557temporal action detectioncomputer visiondeep learningartificial intelligence
spellingShingle	Yilong He Yong Zhong Lishun Wang Jiachen Dang GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection Applied Sciences temporal action detection computer vision deep learning artificial intelligence
title	GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection
title_full	GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection
title_fullStr	GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection
title_full_unstemmed	GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection
title_short	GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection
title_sort	glformer global and local context aggregation network for temporal action detection
topic	temporal action detection computer vision deep learning artificial intelligence
url	https://www.mdpi.com/2076-3417/12/17/8557
work_keys_str_mv	AT yilonghe glformerglobalandlocalcontextaggregationnetworkfortemporalactiondetection AT yongzhong glformerglobalandlocalcontextaggregationnetworkfortemporalactiondetection AT lishunwang glformerglobalandlocalcontextaggregationnetworkfortemporalactiondetection AT jiachendang glformerglobalandlocalcontextaggregationnetworkfortemporalactiondetection

GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection

Similar Items