Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals

This paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively d...

Full description

Bibliographic Details
Main Authors: Yeongtaek Song, Incheol Kim
Format: Article
Language:English
Published: MDPI AG 2019-03-01
Series:Sensors
Subjects:
Online Access:http://www.mdpi.com/1424-8220/19/5/1085
_version_ 1811262967294984192
author Yeongtaek Song
Incheol Kim
author_facet Yeongtaek Song
Incheol Kim
author_sort Yeongtaek Song
collection DOAJ
description This paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively detect multiple-action regions. First, in the temporal region proposal, anchor boxes were generated by targeting regions expected to potentially contain actions. Unlike the conventional temporal region proposal methods, the proposed method uses a complementary two-stage method to effectively detect the temporal regions of the respective actions occurring asynchronously. In addition, to detect a principal agent performing an action among the people appearing in a video, the spatial region proposal process was used. Further, coarse-level features contain comprehensive information of the whole video and have been frequently used in conventional action-detection studies. However, they cannot provide detailed information of each person performing an action in a video. In order to overcome the limitation of coarse-level features, the proposed model additionally learns fine-level features from the proposed action tubes in the video. Various experiments conducted using the LIRIS-HARL and UCF-10 datasets confirm the high performance and effectiveness of the proposed deep neural network model.
first_indexed 2024-04-12T19:36:24Z
format Article
id doaj.art-6b6514d0ced348239c8107fdff08c3b8
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-04-12T19:36:24Z
publishDate 2019-03-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-6b6514d0ced348239c8107fdff08c3b82022-12-22T03:19:12ZengMDPI AGSensors1424-82202019-03-01195108510.3390/s19051085s19051085Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region ProposalsYeongtaek Song0Incheol Kim1Department of Computer Science, Graduate School, Kyonggi University, 154-42 Gwanggyosan-ro Yeongtong-gu, Suwon-si 16227, KoreaDepartment of Computer Science, Kyonggi University, 154-42 Gwanggyosan-ro Yeongtong-gu, Suwon-si 16227, KoreaThis paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively detect multiple-action regions. First, in the temporal region proposal, anchor boxes were generated by targeting regions expected to potentially contain actions. Unlike the conventional temporal region proposal methods, the proposed method uses a complementary two-stage method to effectively detect the temporal regions of the respective actions occurring asynchronously. In addition, to detect a principal agent performing an action among the people appearing in a video, the spatial region proposal process was used. Further, coarse-level features contain comprehensive information of the whole video and have been frequently used in conventional action-detection studies. However, they cannot provide detailed information of each person performing an action in a video. In order to overcome the limitation of coarse-level features, the proposed model additionally learns fine-level features from the proposed action tubes in the video. Various experiments conducted using the LIRIS-HARL and UCF-10 datasets confirm the high performance and effectiveness of the proposed deep neural network model.http://www.mdpi.com/1424-8220/19/5/1085video action detectionregion proposalspatio-temporal action detectionrecurrent neural network
spellingShingle Yeongtaek Song
Incheol Kim
Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
Sensors
video action detection
region proposal
spatio-temporal action detection
recurrent neural network
title Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_full Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_fullStr Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_full_unstemmed Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_short Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
title_sort spatio temporal action detection in untrimmed videos by using multimodal features and region proposals
topic video action detection
region proposal
spatio-temporal action detection
recurrent neural network
url http://www.mdpi.com/1424-8220/19/5/1085
work_keys_str_mv AT yeongtaeksong spatiotemporalactiondetectioninuntrimmedvideosbyusingmultimodalfeaturesandregionproposals
AT incheolkim spatiotemporalactiondetectioninuntrimmedvideosbyusingmultimodalfeaturesandregionproposals