Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals

This paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively d...

Full description

Bibliographic Details
Main Authors:	Yeongtaek Song, Incheol Kim
Format:	Article
Language:	English
Published:	MDPI AG 2019-03-01
Series:	Sensors
Subjects:	video action detection region proposal spatio-temporal action detection recurrent neural network
Online Access:	http://www.mdpi.com/1424-8220/19/5/1085

_version_	1811262967294984192
author	Yeongtaek Song Incheol Kim
author_facet	Yeongtaek Song Incheol Kim
author_sort	Yeongtaek Song
collection	DOAJ
description	This paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively detect multiple-action regions. First, in the temporal region proposal, anchor boxes were generated by targeting regions expected to potentially contain actions. Unlike the conventional temporal region proposal methods, the proposed method uses a complementary two-stage method to effectively detect the temporal regions of the respective actions occurring asynchronously. In addition, to detect a principal agent performing an action among the people appearing in a video, the spatial region proposal process was used. Further, coarse-level features contain comprehensive information of the whole video and have been frequently used in conventional action-detection studies. However, they cannot provide detailed information of each person performing an action in a video. In order to overcome the limitation of coarse-level features, the proposed model additionally learns fine-level features from the proposed action tubes in the video. Various experiments conducted using the LIRIS-HARL and UCF-10 datasets confirm the high performance and effectiveness of the proposed deep neural network model.
first_indexed	2024-04-12T19:36:24Z
format	Article
id	doaj.art-6b6514d0ced348239c8107fdff08c3b8
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-04-12T19:36:24Z
publishDate	2019-03-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-6b6514d0ced348239c8107fdff08c3b82022-12-22T03:19:12ZengMDPI AGSensors1424-82202019-03-01195108510.3390/s19051085s19051085Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region ProposalsYeongtaek Song0Incheol Kim1Department of Computer Science, Graduate School, Kyonggi University, 154-42 Gwanggyosan-ro Yeongtong-gu, Suwon-si 16227, KoreaDepartment of Computer Science, Kyonggi University, 154-42 Gwanggyosan-ro Yeongtong-gu, Suwon-si 16227, KoreaThis paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively detect multiple-action regions. First, in the temporal region proposal, anchor boxes were generated by targeting regions expected to potentially contain actions. Unlike the conventional temporal region proposal methods, the proposed method uses a complementary two-stage method to effectively detect the temporal regions of the respective actions occurring asynchronously. In addition, to detect a principal agent performing an action among the people appearing in a video, the spatial region proposal process was used. Further, coarse-level features contain comprehensive information of the whole video and have been frequently used in conventional action-detection studies. However, they cannot provide detailed information of each person performing an action in a video. In order to overcome the limitation of coarse-level features, the proposed model additionally learns fine-level features from the proposed action tubes in the video. Various experiments conducted using the LIRIS-HARL and UCF-10 datasets confirm the high performance and effectiveness of the proposed deep neural network model.http://www.mdpi.com/1424-8220/19/5/1085video action detectionregion proposalspatio-temporal action detectionrecurrent neural network
spellingShingle	Yeongtaek Song Incheol Kim Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals Sensors video action detection region proposal spatio-temporal action detection recurrent neural network
title	Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_full	Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_fullStr	Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_full_unstemmed	Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_short	Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_sort	spatio temporal action detection in untrimmed videos by using multimodal features and region proposals
topic	video action detection region proposal spatio-temporal action detection recurrent neural network
url	http://www.mdpi.com/1424-8220/19/5/1085
work_keys_str_mv	AT yeongtaeksong spatiotemporalactiondetectioninuntrimmedvideosbyusingmultimodalfeaturesandregionproposals AT incheolkim spatiotemporalactiondetectioninuntrimmedvideosbyusingmultimodalfeaturesandregionproposals

Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals

Similar Items