STSM: Spatio-Temporal Shift Module for Efficient Action Recognition
The modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-09-01
|
Series: | Mathematics |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-7390/10/18/3290 |
_version_ | 1827659196381790208 |
---|---|
author | Zhaoqilin Yang Gaoyun An Ruichen Zhang |
author_facet | Zhaoqilin Yang Gaoyun An Ruichen Zhang |
author_sort | Zhaoqilin Yang |
collection | DOAJ |
description | The modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain good performance, it is with both high computational complexity and a large number of parameters. In this paper, we propose a plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module. STSM can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity. In particular, when 2D CNNs and STSM are integrated, the new network may learn spatio-temporal features and outperform networks based on 3D convolutions. We revisit the shift operation from the perspective of matrix algebra, i.e., the spatio-temporal shift operation is a convolution operation with a sparse convolution kernel. Furthermore, we extensively evaluate the proposed module on Kinetics-400 and Something-Something V2 datasets. The experimental results show the effectiveness of the proposed STSM, and the proposed action recognition networks may also achieve state-of-the-art results on the two action recognition benchmarks. |
first_indexed | 2024-03-09T23:16:10Z |
format | Article |
id | doaj.art-f294555d85864f068fd21abb87e99c96 |
institution | Directory Open Access Journal |
issn | 2227-7390 |
language | English |
last_indexed | 2024-03-09T23:16:10Z |
publishDate | 2022-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Mathematics |
spelling | doaj.art-f294555d85864f068fd21abb87e99c962023-11-23T17:36:02ZengMDPI AGMathematics2227-73902022-09-011018329010.3390/math10183290STSM: Spatio-Temporal Shift Module for Efficient Action RecognitionZhaoqilin Yang0Gaoyun An1Ruichen Zhang2Institute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaInstitute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaSchool of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, ChinaThe modeling, computational complexity, and accuracy of spatio-temporal models are the three major foci in the field of video action recognition. The traditional 2D convolution has low computational complexity, but it cannot capture the temporal relationships. Although the 3D convolution can obtain good performance, it is with both high computational complexity and a large number of parameters. In this paper, we propose a plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module. STSM can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity. In particular, when 2D CNNs and STSM are integrated, the new network may learn spatio-temporal features and outperform networks based on 3D convolutions. We revisit the shift operation from the perspective of matrix algebra, i.e., the spatio-temporal shift operation is a convolution operation with a sparse convolution kernel. Furthermore, we extensively evaluate the proposed module on Kinetics-400 and Something-Something V2 datasets. The experimental results show the effectiveness of the proposed STSM, and the proposed action recognition networks may also achieve state-of-the-art results on the two action recognition benchmarks.https://www.mdpi.com/2227-7390/10/18/3290spatio-temporal featuresshift operationaction recognition2D convolution |
spellingShingle | Zhaoqilin Yang Gaoyun An Ruichen Zhang STSM: Spatio-Temporal Shift Module for Efficient Action Recognition Mathematics spatio-temporal features shift operation action recognition 2D convolution |
title | STSM: Spatio-Temporal Shift Module for Efficient Action Recognition |
title_full | STSM: Spatio-Temporal Shift Module for Efficient Action Recognition |
title_fullStr | STSM: Spatio-Temporal Shift Module for Efficient Action Recognition |
title_full_unstemmed | STSM: Spatio-Temporal Shift Module for Efficient Action Recognition |
title_short | STSM: Spatio-Temporal Shift Module for Efficient Action Recognition |
title_sort | stsm spatio temporal shift module for efficient action recognition |
topic | spatio-temporal features shift operation action recognition 2D convolution |
url | https://www.mdpi.com/2227-7390/10/18/3290 |
work_keys_str_mv | AT zhaoqilinyang stsmspatiotemporalshiftmoduleforefficientactionrecognition AT gaoyunan stsmspatiotemporalshiftmoduleforefficientactionrecognition AT ruichenzhang stsmspatiotemporalshiftmoduleforefficientactionrecognition |