Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

A pre-trained 2D CNN (Convolutional Neural Network) can be used for the spatial stream in the two-stream CNN structure for videos, treating the representative frame selected from the video as an input. However, the CNN for the temporal stream in the two-stream CNN needs training from scratch using t...

Full description

Bibliographic Details
Main Authors:	Jun-Hwa Kim, Chee Sun Won
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Convolutional neural network (CNN) action recognition video analysis two-stream convolutional neural networks
Online Access:	https://ieeexplore.ieee.org/document/9047853/

_version_	1819120765109272576
author	Jun-Hwa Kim Chee Sun Won
author_facet	Jun-Hwa Kim Chee Sun Won
author_sort	Jun-Hwa Kim
collection	DOAJ
description	A pre-trained 2D CNN (Convolutional Neural Network) can be used for the spatial stream in the two-stream CNN structure for videos, treating the representative frame selected from the video as an input. However, the CNN for the temporal stream in the two-stream CNN needs training from scratch using the optical flow frames, which demands expensive computations. In this paper, we propose to adopt a pre-trained 2D CNN for the temporal stream to avoid the optical flow computations. Specifically, three RGB frames selected at three different times in the video sequence are converted into grayscale images and are assigned to three R(red), G(green), and B(blue) channels, respectively, to form a Stacked Grayscale 3-channel Image (SG3I). Then, the pre-trained 2D CNN is fine-tuned by SG3Is for the temporal stream CNN. Therefore, only pre-trained 2D CNNs are used for both spatial and temporal streams. To learn long-range temporal motions in videos, we can use multiple SG3Is by partitioning the video shot into sub-shots and a single SG3I is generated for each sub-shot. Experimental results show that our two-stream CNN with the proposed SG3Is is about 14.6 times faster than the first version of the two-stream CNN with the optical flow, and yet achieves a similar recognition accuracy for UCF-101 and a 5.7% better result for HMDB-51.
first_indexed	2024-12-22T06:25:52Z
format	Article
id	doaj.art-b02a5074f90e4f158385bb1e8979b6ee
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-22T06:25:52Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-b02a5074f90e4f158385bb1e8979b6ee2022-12-21T18:35:51ZengIEEEIEEE Access2169-35362020-01-018601796018810.1109/ACCESS.2020.29834279047853Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural NetworksJun-Hwa Kim0Chee Sun Won1https://orcid.org/0000-0002-3400-0792Department of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaDepartment of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaA pre-trained 2D CNN (Convolutional Neural Network) can be used for the spatial stream in the two-stream CNN structure for videos, treating the representative frame selected from the video as an input. However, the CNN for the temporal stream in the two-stream CNN needs training from scratch using the optical flow frames, which demands expensive computations. In this paper, we propose to adopt a pre-trained 2D CNN for the temporal stream to avoid the optical flow computations. Specifically, three RGB frames selected at three different times in the video sequence are converted into grayscale images and are assigned to three R(red), G(green), and B(blue) channels, respectively, to form a Stacked Grayscale 3-channel Image (SG3I). Then, the pre-trained 2D CNN is fine-tuned by SG3Is for the temporal stream CNN. Therefore, only pre-trained 2D CNNs are used for both spatial and temporal streams. To learn long-range temporal motions in videos, we can use multiple SG3Is by partitioning the video shot into sub-shots and a single SG3I is generated for each sub-shot. Experimental results show that our two-stream CNN with the proposed SG3Is is about 14.6 times faster than the first version of the two-stream CNN with the optical flow, and yet achieves a similar recognition accuracy for UCF-101 and a 5.7% better result for HMDB-51.https://ieeexplore.ieee.org/document/9047853/Convolutional neural network (CNN)action recognitionvideo analysistwo-stream convolutional neural networks
spellingShingle	Jun-Hwa Kim Chee Sun Won Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks IEEE Access Convolutional neural network (CNN) action recognition video analysis two-stream convolutional neural networks
title	Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks
title_full	Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks
title_fullStr	Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks
title_full_unstemmed	Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks
title_short	Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks
title_sort	action recognition in videos using pre trained 2d convolutional neural networks
topic	Convolutional neural network (CNN) action recognition video analysis two-stream convolutional neural networks
url	https://ieeexplore.ieee.org/document/9047853/
work_keys_str_mv	AT junhwakim actionrecognitioninvideosusingpretrained2dconvolutionalneuralnetworks AT cheesunwon actionrecognitioninvideosusingpretrained2dconvolutionalneuralnetworks

Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

Similar Items