STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

The modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain...

Full description

Bibliographic Details
Main Authors:	Zhaoqilin Yang, Gaoyun An, Ruichen Zhang
Format:	Article
Language:	English
Published:	MDPI AG 2022-09-01
Series:	Mathematics
Subjects:	spatio-temporal features shift operation action recognition 2D convolution
Online Access:	https://www.mdpi.com/2227-7390/10/18/3290

_version_	1827659196381790208
author	Zhaoqilin Yang Gaoyun An Ruichen Zhang
author_facet	Zhaoqilin Yang Gaoyun An Ruichen Zhang
author_sort	Zhaoqilin Yang
collection	DOAJ
description	The modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain good performance, it is with both high computational complexity and a large number of parameters. In this paper, we propose a plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module. STSM can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity. In particular, when 2D CNNs and STSM are integrated, the new network may learn spatio-temporal features and outperform networks based on 3D convolutions. We revisit the shift operation from the perspective of matrix algebra, i.e., the spatio-temporal shift operation is a convolution operation with a sparse convolution kernel. Furthermore, we extensively evaluate the proposed module on Kinetics-400 and Something-Something V2 datasets. The experimental results show the effectiveness of the proposed STSM, and the proposed action recognition networks may also achieve state-of-the-art results on the two action recognition benchmarks.
first_indexed	2024-03-09T23:16:10Z
format	Article
id	doaj.art-f294555d85864f068fd21abb87e99c96
institution	Directory Open Access Journal
issn	2227-7390
language	English
last_indexed	2024-03-09T23:16:10Z
publishDate	2022-09-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj.art-f294555d85864f068fd21abb87e99c962023-11-23T17:36:02ZengMDPI AGMathematics2227-73902022-09-011018329010.3390/math10183290STSM: Spatio-Temporal Shift Module for Efficient Action RecognitionZhaoqilin Yang0Gaoyun An1Ruichen Zhang2Institute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaInstitute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaSchool of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, ChinaThe modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain good performance, it is with both high computational complexity and a large number of parameters. In this paper, we propose a plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module. STSM can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity. In particular, when 2D CNNs and STSM are integrated, the new network may learn spatio-temporal features and outperform networks based on 3D convolutions. We revisit the shift operation from the perspective of matrix algebra, i.e., the spatio-temporal shift operation is a convolution operation with a sparse convolution kernel. Furthermore, we extensively evaluate the proposed module on Kinetics-400 and Something-Something V2 datasets. The experimental results show the effectiveness of the proposed STSM, and the proposed action recognition networks may also achieve state-of-the-art results on the two action recognition benchmarks.https://www.mdpi.com/2227-7390/10/18/3290spatio-temporal featuresshift operationaction recognition2D convolution
spellingShingle	Zhaoqilin Yang Gaoyun An Ruichen Zhang STSM: Spatio-Temporal Shift Module for Efficient Action Recognition Mathematics spatio-temporal features shift operation action recognition 2D convolution
title	STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_full	STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_fullStr	STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_full_unstemmed	STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_short	STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
title_sort	stsm spatio temporal shift module for efficient action recognition
topic	spatio-temporal features shift operation action recognition 2D convolution
url	https://www.mdpi.com/2227-7390/10/18/3290
work_keys_str_mv	AT zhaoqilinyang stsmspatiotemporalshiftmoduleforefficientactionrecognition AT gaoyunan stsmspatiotemporalshiftmoduleforefficientactionrecognition AT ruichenzhang stsmspatiotemporalshiftmoduleforefficientactionrecognition

STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

Similar Items