Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module

At present, the mainstream 2D convolution neural network method for video action recognition can't extract the relevant information between input frames, which makes it difficult for the network to obtain the spatio-temporal feature information between input frames and improve the recognition a...

Full description

Bibliographic Details
Main Author: GONG Suming, CHEN Ying
Format: Article
Language:zho
Published: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press 2022-09-01
Series:Jisuanji kexue yu tansuo
Subjects:
Online Access:http://fcst.ceaj.org/fileup/1673-9418/PDF/2012119.pdf
_version_ 1811307532770082816
author GONG Suming, CHEN Ying
author_facet GONG Suming, CHEN Ying
author_sort GONG Suming, CHEN Ying
collection DOAJ
description At present, the mainstream 2D convolution neural network method for video action recognition can't extract the relevant information between input frames, which makes it difficult for the network to obtain the spatio-temporal feature information between input frames and improve the recognition accuracy. To solve the existing problems, a universal spatio-temporal feature pyramid module (STFPM) is proposed. STFPM consists of feature pyramid and dilated convolution pyramid, which can be directly embedded into the existing 2D convolution network to form a new action recognition network named spatio-temporal feature pyramid net (STFP-Net). For multi-frame image input, STFP-Net first extracts the individual spatial feature information of each frame input and records it as the original feature. Then, the designed STFPM uses matrix operation to construct the feature pyramid of the original feature. Furthermore, the spatio-temporal features with temporal and spatial correlation are extracted by applying the dilated convolution pyramid to feature pyramid. Next, the original features and spatio-temporal features are fused by a weighted summation and transmitted to the deep network. Finally, the action in the video is classified by full connected layer. Compared with Baseline, STFP-Net introduces negligible additional parameters and computational complexity. Experimental results show that compared with mainstream methods in recent years, STFP-Net has significant improvement in classification accuracy on the general datasets UCF101 and HMDB51.
first_indexed 2024-04-13T09:05:59Z
format Article
id doaj.art-d60d2d6ac63c4fa58aea159438c72ab6
institution Directory Open Access Journal
issn 1673-9418
language zho
last_indexed 2024-04-13T09:05:59Z
publishDate 2022-09-01
publisher Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
record_format Article
series Jisuanji kexue yu tansuo
spelling doaj.art-d60d2d6ac63c4fa58aea159438c72ab62022-12-22T02:52:59ZzhoJournal of Computer Engineering and Applications Beijing Co., Ltd., Science PressJisuanji kexue yu tansuo1673-94182022-09-011692061206710.3778/j.issn.1673-9418.2012119Video Action Recognition Based on Spatio-Temporal Feature Pyramid ModuleGONG Suming, CHEN Ying0Key Laboratory of Advanced Process Control for Light Industry, Ministry of Education, Jiangnan University, Wuxi, Jiangsu 214122, ChinaAt present, the mainstream 2D convolution neural network method for video action recognition can't extract the relevant information between input frames, which makes it difficult for the network to obtain the spatio-temporal feature information between input frames and improve the recognition accuracy. To solve the existing problems, a universal spatio-temporal feature pyramid module (STFPM) is proposed. STFPM consists of feature pyramid and dilated convolution pyramid, which can be directly embedded into the existing 2D convolution network to form a new action recognition network named spatio-temporal feature pyramid net (STFP-Net). For multi-frame image input, STFP-Net first extracts the individual spatial feature information of each frame input and records it as the original feature. Then, the designed STFPM uses matrix operation to construct the feature pyramid of the original feature. Furthermore, the spatio-temporal features with temporal and spatial correlation are extracted by applying the dilated convolution pyramid to feature pyramid. Next, the original features and spatio-temporal features are fused by a weighted summation and transmitted to the deep network. Finally, the action in the video is classified by full connected layer. Compared with Baseline, STFP-Net introduces negligible additional parameters and computational complexity. Experimental results show that compared with mainstream methods in recent years, STFP-Net has significant improvement in classification accuracy on the general datasets UCF101 and HMDB51.http://fcst.ceaj.org/fileup/1673-9418/PDF/2012119.pdf|action recognition|2d convolution network|spatio-temporal features|feature pyramid|dilated convolu-tion pyramid
spellingShingle GONG Suming, CHEN Ying
Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module
Jisuanji kexue yu tansuo
|action recognition|2d convolution network|spatio-temporal features|feature pyramid|dilated convolu-tion pyramid
title Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module
title_full Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module
title_fullStr Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module
title_full_unstemmed Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module
title_short Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module
title_sort video action recognition based on spatio temporal feature pyramid module
topic |action recognition|2d convolution network|spatio-temporal features|feature pyramid|dilated convolu-tion pyramid
url http://fcst.ceaj.org/fileup/1673-9418/PDF/2012119.pdf
work_keys_str_mv AT gongsumingchenying videoactionrecognitionbasedonspatiotemporalfeaturepyramidmodule