Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition

In recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it d...

Full description

Bibliographic Details
Main Authors:	Bo Chen, Fangzhou Meng, Hongying Tang, Guanjun Tong
Format:	Article
Language:	English
Published:	MDPI AG 2023-02-01
Series:	Sensors
Subjects:	action recognition attention mechanism spatiotemporal features CNNs
Online Access:	https://www.mdpi.com/1424-8220/23/3/1707

_version_	1797623143855030272
author	Bo Chen Fangzhou Meng Hongying Tang Guanjun Tong
author_facet	Bo Chen Fangzhou Meng Hongying Tang Guanjun Tong
author_sort	Bo Chen
collection	DOAJ
description	In recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it difficult for the models to adequately extract spatiotemporal features from the video data. In this paper, an action recognition method based on improved residual convolutional neural networks (CNNs) for video frames and spatial attention modules is proposed to address this problem. The network can guide what and where to emphasize or suppress with essentially little computational cost using the video frame attention module and the spatial attention module. It also employs a two-level attention module to emphasize feature information along the temporal and spatial dimensions, respectively, highlighting the more important frames in the overall video sequence and the more important spatial regions in some specific frames. Specifically, we create the video frame and spatial attention map by successively adding the video frame attention module and the spatial attention module to aggregate the spatial and temporal dimensions of the intermediate feature maps of the CNNs to obtain different feature descriptors, thus directing the network to focus more on important video frames and more contributing spatial regions. The experimental results further show that the network performs well on the UCF-101 and HMDB-51 datasets.
first_indexed	2024-03-11T09:24:28Z
format	Article
id	doaj.art-c7ac5c7ee46d4028b229e73d8889de74
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-11T09:24:28Z
publishDate	2023-02-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-c7ac5c7ee46d4028b229e73d8889de742023-11-16T18:05:21ZengMDPI AGSensors1424-82202023-02-01233170710.3390/s23031707Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action RecognitionBo Chen0Fangzhou Meng1Hongying Tang2Guanjun Tong3Science and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaScience and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaScience and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaScience and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaIn recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it difficult for the models to adequately extract spatiotemporal features from the video data. In this paper, an action recognition method based on improved residual convolutional neural networks (CNNs) for video frames and spatial attention modules is proposed to address this problem. The network can guide what and where to emphasize or suppress with essentially little computational cost using the video frame attention module and the spatial attention module. It also employs a two-level attention module to emphasize feature information along the temporal and spatial dimensions, respectively, highlighting the more important frames in the overall video sequence and the more important spatial regions in some specific frames. Specifically, we create the video frame and spatial attention map by successively adding the video frame attention module and the spatial attention module to aggregate the spatial and temporal dimensions of the intermediate feature maps of the CNNs to obtain different feature descriptors, thus directing the network to focus more on important video frames and more contributing spatial regions. The experimental results further show that the network performs well on the UCF-101 and HMDB-51 datasets.https://www.mdpi.com/1424-8220/23/3/1707action recognitionattention mechanismspatiotemporal featuresCNNs
spellingShingle	Bo Chen Fangzhou Meng Hongying Tang Guanjun Tong Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition Sensors action recognition attention mechanism spatiotemporal features CNNs
title	Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_full	Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_fullStr	Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_full_unstemmed	Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_short	Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_sort	two level attention module based on spurious 3d residual networks for human action recognition
topic	action recognition attention mechanism spatiotemporal features CNNs
url	https://www.mdpi.com/1424-8220/23/3/1707
work_keys_str_mv	AT bochen twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition AT fangzhoumeng twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition AT hongyingtang twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition AT guanjuntong twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition

Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition

Similar Items