Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Abstract In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple stag...

Szczegółowa specyfikacja

Opis bibliograficzny
Główni autorzy:	Zhengkui Weng, Xinmin Li, Shoujian Xiong
Format:	Artykuł
Język:	English
Wydane:	Nature Portfolio 2024-10-01
Seria:	Scientific Reports
Dostęp online:	https://doi.org/10.1038/s41598-024-75640-6

_version_	1826990883200827392
author	Zhengkui Weng Xinmin Li Shoujian Xiong
author_facet	Zhengkui Weng Xinmin Li Shoujian Xiong
author_sort	Zhengkui Weng
collection	DOAJ
description	Abstract In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple staged behaviors. In this paper, a novel attention-based spatio-temporal VLAD network (AST-VLAD) with self-attention model is developed to aggregate the informative deep features across the video according to the adaptive deep feature selected. Moreover, an overall automatic approach to adaptive video sequences optimization (AVSO) is proposed through shot segmentation and dynamic weighted sampling, the AVSO increase in the proportion of action-related frames and eliminate the redundant intervals. Then, based on the optimized video, a self-attention model is introduced in AST-VLAD to modeling the intrinsic spatio-temporal relationship of deep features instead of solving the frame-level features in an average or max pooling manner. Extensive experiments are conducted on two public benchmarks-HMDB51 and UCF101 for evaluation. As compared to the existing frameworks, results show that the proposed approach performs better or as well in the accuracy of classification on both HMDB51 (73.1% ) and UCF101 (96.0%) datasets.
first_indexed	2025-02-18T08:27:10Z
format	Article
id	doaj.art-2f21925b5c6e468d8d85d0f55be81f3c
institution	Directory Open Access Journal
issn	2045-2322
language	English
last_indexed	2025-02-18T08:27:10Z
publishDate	2024-10-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj.art-2f21925b5c6e468d8d85d0f55be81f3c2024-11-03T12:25:08ZengNature PortfolioScientific Reports2045-23222024-10-0114111710.1038/s41598-024-75640-6Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimizationZhengkui Weng0Xinmin Li1Shoujian Xiong2School of Automation, Qingdao UniversitySchool of Mathematics & Statistics, Qingdao UniversityZhejiang Lancoo Technology Co., LtdAbstract In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple staged behaviors. In this paper, a novel attention-based spatio-temporal VLAD network (AST-VLAD) with self-attention model is developed to aggregate the informative deep features across the video according to the adaptive deep feature selected. Moreover, an overall automatic approach to adaptive video sequences optimization (AVSO) is proposed through shot segmentation and dynamic weighted sampling, the AVSO increase in the proportion of action-related frames and eliminate the redundant intervals. Then, based on the optimized video, a self-attention model is introduced in AST-VLAD to modeling the intrinsic spatio-temporal relationship of deep features instead of solving the frame-level features in an average or max pooling manner. Extensive experiments are conducted on two public benchmarks-HMDB51 and UCF101 for evaluation. As compared to the existing frameworks, results show that the proposed approach performs better or as well in the accuracy of classification on both HMDB51 (73.1% ) and UCF101 (96.0%) datasets.https://doi.org/10.1038/s41598-024-75640-6
spellingShingle	Zhengkui Weng Xinmin Li Shoujian Xiong Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization Scientific Reports
title	Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization
title_full	Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization
title_fullStr	Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization
title_full_unstemmed	Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization
title_short	Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization
title_sort	action recognition using attention based spatio temporal vlad networks and adaptive video sequences optimization
url	https://doi.org/10.1038/s41598-024-75640-6
work_keys_str_mv	AT zhengkuiweng actionrecognitionusingattentionbasedspatiotemporalvladnetworksandadaptivevideosequencesoptimization AT xinminli actionrecognitionusingattentionbasedspatiotemporalvladnetworksandadaptivevideosequencesoptimization AT shoujianxiong actionrecognitionusingattentionbasedspatiotemporalvladnetworksandadaptivevideosequencesoptimization

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Podobne zapisy