Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition

In recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it d...

Full description

Bibliographic Details
Main Authors: Bo Chen, Fangzhou Meng, Hongying Tang, Guanjun Tong
Format: Article
Language:English
Published: MDPI AG 2023-02-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/23/3/1707
_version_ 1797623143855030272
author Bo Chen
Fangzhou Meng
Hongying Tang
Guanjun Tong
author_facet Bo Chen
Fangzhou Meng
Hongying Tang
Guanjun Tong
author_sort Bo Chen
collection DOAJ
description In recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it difficult for the models to adequately extract spatiotemporal features from the video data. In this paper, an action recognition method based on improved residual convolutional neural networks (CNNs) for video frames and spatial attention modules is proposed to address this problem. The network can guide what and where to emphasize or suppress with essentially little computational cost using the video frame attention module and the spatial attention module. It also employs a two-level attention module to emphasize feature information along the temporal and spatial dimensions, respectively, highlighting the more important frames in the overall video sequence and the more important spatial regions in some specific frames. Specifically, we create the video frame and spatial attention map by successively adding the video frame attention module and the spatial attention module to aggregate the spatial and temporal dimensions of the intermediate feature maps of the CNNs to obtain different feature descriptors, thus directing the network to focus more on important video frames and more contributing spatial regions. The experimental results further show that the network performs well on the UCF-101 and HMDB-51 datasets.
first_indexed 2024-03-11T09:24:28Z
format Article
id doaj.art-c7ac5c7ee46d4028b229e73d8889de74
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-11T09:24:28Z
publishDate 2023-02-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-c7ac5c7ee46d4028b229e73d8889de742023-11-16T18:05:21ZengMDPI AGSensors1424-82202023-02-01233170710.3390/s23031707Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action RecognitionBo Chen0Fangzhou Meng1Hongying Tang2Guanjun Tong3Science and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaScience and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaScience and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaScience and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, ChinaIn recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it difficult for the models to adequately extract spatiotemporal features from the video data. In this paper, an action recognition method based on improved residual convolutional neural networks (CNNs) for video frames and spatial attention modules is proposed to address this problem. The network can guide what and where to emphasize or suppress with essentially little computational cost using the video frame attention module and the spatial attention module. It also employs a two-level attention module to emphasize feature information along the temporal and spatial dimensions, respectively, highlighting the more important frames in the overall video sequence and the more important spatial regions in some specific frames. Specifically, we create the video frame and spatial attention map by successively adding the video frame attention module and the spatial attention module to aggregate the spatial and temporal dimensions of the intermediate feature maps of the CNNs to obtain different feature descriptors, thus directing the network to focus more on important video frames and more contributing spatial regions. The experimental results further show that the network performs well on the UCF-101 and HMDB-51 datasets.https://www.mdpi.com/1424-8220/23/3/1707action recognitionattention mechanismspatiotemporal featuresCNNs
spellingShingle Bo Chen
Fangzhou Meng
Hongying Tang
Guanjun Tong
Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
Sensors
action recognition
attention mechanism
spatiotemporal features
CNNs
title Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_full Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_fullStr Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_full_unstemmed Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_short Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition
title_sort two level attention module based on spurious 3d residual networks for human action recognition
topic action recognition
attention mechanism
spatiotemporal features
CNNs
url https://www.mdpi.com/1424-8220/23/3/1707
work_keys_str_mv AT bochen twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition
AT fangzhoumeng twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition
AT hongyingtang twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition
AT guanjuntong twolevelattentionmodulebasedonspurious3dresidualnetworksforhumanactionrecognition