STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

The modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain...

Full description

Bibliographic Details
Main Authors: Zhaoqilin Yang, Gaoyun An, Ruichen Zhang
Format: Article
Language:English
Published: MDPI AG 2022-09-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/10/18/3290
_version_ 1827659196381790208
author Zhaoqilin Yang
Gaoyun An
Ruichen Zhang
author_facet Zhaoqilin Yang
Gaoyun An
Ruichen Zhang
author_sort Zhaoqilin Yang
collection DOAJ
description The modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain good performance, it is with both high computational complexity and a large number of parameters. In this paper, we propose a plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module. STSM can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity. In particular, when 2D CNNs and STSM are integrated, the new network may learn spatio-temporal features and outperform networks based on 3D convolutions. We revisit the shift operation from the perspective of matrix algebra, i.e., the spatio-temporal shift operation is a convolution operation with a sparse convolution kernel. Furthermore, we extensively evaluate the proposed module on Kinetics-400 and Something-Something V2 datasets. The experimental results show the effectiveness of the proposed STSM, and the proposed action recognition networks may also achieve state-of-the-art results on the two action recognition benchmarks.
first_indexed 2024-03-09T23:16:10Z
format Article
id doaj.art-f294555d85864f068fd21abb87e99c96
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-09T23:16:10Z
publishDate 2022-09-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-f294555d85864f068fd21abb87e99c962023-11-23T17:36:02ZengMDPI AGMathematics2227-73902022-09-011018329010.3390/math10183290STSM: Spatio-Temporal Shift Module for Efficient Action RecognitionZhaoqilin Yang0Gaoyun An1Ruichen Zhang2Institute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaInstitute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaSchool of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, ChinaThe modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain good performance, it is with both high computational complexity and a large number of parameters. In this paper, we propose a plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module. STSM can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity. In particular, when 2D CNNs and STSM are integrated, the new network may learn spatio-temporal features and outperform networks based on 3D convolutions. We revisit the shift operation from the perspective of matrix algebra, i.e., the spatio-temporal shift operation is a convolution operation with a sparse convolution kernel. Furthermore, we extensively evaluate the proposed module on Kinetics-400 and Something-Something V2 datasets. The experimental results show the effectiveness of the proposed STSM, and the proposed action recognition networks may also achieve state-of-the-art results on the two action recognition benchmarks.https://www.mdpi.com/2227-7390/10/18/3290spatio-temporal featuresshift operationaction recognition2D convolution
spellingShingle Zhaoqilin Yang
Gaoyun An
Ruichen Zhang
STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
Mathematics
spatio-temporal features
shift operation
action recognition
2D convolution
title STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_full STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_fullStr STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_full_unstemmed STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_short STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_sort stsm spatio temporal shift module for efficient action recognition
topic spatio-temporal features
shift operation
action recognition
2D convolution
url https://www.mdpi.com/2227-7390/10/18/3290
work_keys_str_mv AT zhaoqilinyang stsmspatiotemporalshiftmoduleforefficientactionrecognition
AT gaoyunan stsmspatiotemporalshiftmoduleforefficientactionrecognition
AT ruichenzhang stsmspatiotemporalshiftmoduleforefficientactionrecognition