Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals
This paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively d...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2019-03-01
|
Series: | Sensors |
Subjects: | |
Online Access: | http://www.mdpi.com/1424-8220/19/5/1085 |
_version_ | 1811262967294984192 |
---|---|
author | Yeongtaek Song Incheol Kim |
author_facet | Yeongtaek Song Incheol Kim |
author_sort | Yeongtaek Song |
collection | DOAJ |
description | This paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively detect multiple-action regions. First, in the temporal region proposal, anchor boxes were generated by targeting regions expected to potentially contain actions. Unlike the conventional temporal region proposal methods, the proposed method uses a complementary two-stage method to effectively detect the temporal regions of the respective actions occurring asynchronously. In addition, to detect a principal agent performing an action among the people appearing in a video, the spatial region proposal process was used. Further, coarse-level features contain comprehensive information of the whole video and have been frequently used in conventional action-detection studies. However, they cannot provide detailed information of each person performing an action in a video. In order to overcome the limitation of coarse-level features, the proposed model additionally learns fine-level features from the proposed action tubes in the video. Various experiments conducted using the LIRIS-HARL and UCF-10 datasets confirm the high performance and effectiveness of the proposed deep neural network model. |
first_indexed | 2024-04-12T19:36:24Z |
format | Article |
id | doaj.art-6b6514d0ced348239c8107fdff08c3b8 |
institution | Directory Open Access Journal |
issn | 1424-8220 |
language | English |
last_indexed | 2024-04-12T19:36:24Z |
publishDate | 2019-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj.art-6b6514d0ced348239c8107fdff08c3b82022-12-22T03:19:12ZengMDPI AGSensors1424-82202019-03-01195108510.3390/s19051085s19051085Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region ProposalsYeongtaek Song0Incheol Kim1Department of Computer Science, Graduate School, Kyonggi University, 154-42 Gwanggyosan-ro Yeongtong-gu, Suwon-si 16227, KoreaDepartment of Computer Science, Kyonggi University, 154-42 Gwanggyosan-ro Yeongtong-gu, Suwon-si 16227, KoreaThis paper proposes a novel deep neural network model for solving the spatio-temporal-action-detection problem, by localizing all multiple-action regions and classifying the corresponding actions in an untrimmed video. The proposed model uses a spatio-temporal region proposal method to effectively detect multiple-action regions. First, in the temporal region proposal, anchor boxes were generated by targeting regions expected to potentially contain actions. Unlike the conventional temporal region proposal methods, the proposed method uses a complementary two-stage method to effectively detect the temporal regions of the respective actions occurring asynchronously. In addition, to detect a principal agent performing an action among the people appearing in a video, the spatial region proposal process was used. Further, coarse-level features contain comprehensive information of the whole video and have been frequently used in conventional action-detection studies. However, they cannot provide detailed information of each person performing an action in a video. In order to overcome the limitation of coarse-level features, the proposed model additionally learns fine-level features from the proposed action tubes in the video. Various experiments conducted using the LIRIS-HARL and UCF-10 datasets confirm the high performance and effectiveness of the proposed deep neural network model.http://www.mdpi.com/1424-8220/19/5/1085video action detectionregion proposalspatio-temporal action detectionrecurrent neural network |
spellingShingle | Yeongtaek Song Incheol Kim Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals Sensors video action detection region proposal spatio-temporal action detection recurrent neural network |
title | Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals |
title_full | Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals |
title_fullStr | Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals |
title_full_unstemmed | Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals |
title_short | Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals |
title_sort | spatio temporal action detection in untrimmed videos by using multimodal features and region proposals |
topic | video action detection region proposal spatio-temporal action detection recurrent neural network |
url | http://www.mdpi.com/1424-8220/19/5/1085 |
work_keys_str_mv | AT yeongtaeksong spatiotemporalactiondetectioninuntrimmedvideosbyusingmultimodalfeaturesandregionproposals AT incheolkim spatiotemporalactiondetectioninuntrimmedvideosbyusingmultimodalfeaturesandregionproposals |