STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.

Most deep learning-based action recognition models focus only on short-term motions, so the model often causes misjudgments of actions that are combined by multiple processes, such as long jump, high jump, etc. The proposal of Temporal Segment Networks (TSN) enables the network to capture long-term...

Full description

Bibliographic Details
Main Authors: Guoan Yang, Yong Yang, Zhengzhi Lu, Junjie Yang, Deyang Liu, Chuanbo Zhou, Zien Fan
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0265115
_version_ 1828385929930211328
author Guoan Yang
Yong Yang
Zhengzhi Lu
Junjie Yang
Deyang Liu
Chuanbo Zhou
Zien Fan
author_facet Guoan Yang
Yong Yang
Zhengzhi Lu
Junjie Yang
Deyang Liu
Chuanbo Zhou
Zien Fan
author_sort Guoan Yang
collection DOAJ
description Most deep learning-based action recognition models focus only on short-term motions, so the model often causes misjudgments of actions that are combined by multiple processes, such as long jump, high jump, etc. The proposal of Temporal Segment Networks (TSN) enables the network to capture long-term information in the video, but ignores that some unrelated frames or areas in the video can also cause great interference to action recognition. To solve this problem, a soft attention mechanism is introduced in TSN and a Spatial-Temporal Attention Temporal Segment Networks (STA-TSN), which retains the ability to capture long-term information and enables the network to adaptively focus on key features in space and time, is proposed. First, a multi-scale spatial focus feature enhancement strategy is proposed to fuse original convolution features with multi-scale spatial focus features obtained through a soft attention mechanism with spatial pyramid pooling. Second, a deep learning-based key frames exploration module, which utilizes a soft attention mechanism based on Long-Short Term Memory (LSTM) to adaptively learn temporal attention weights, is designed. Third, a temporal-attention regularization is developed to guide our STA-TSN to better realize the exploration of key frames. Finally, the experimental results show that our proposed STA-TSN outperforms TSN in the four public datasets UCF101, HMDB51, JHMDB and THUMOS14, as well as achieves state-of-the-art results.
first_indexed 2024-12-10T05:28:27Z
format Article
id doaj.art-1f771825323b46f28f2d6fcfc8625921
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-10T05:28:27Z
publishDate 2022-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-1f771825323b46f28f2d6fcfc86259212022-12-22T02:00:37ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01173e026511510.1371/journal.pone.0265115STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.Guoan YangYong YangZhengzhi LuJunjie YangDeyang LiuChuanbo ZhouZien FanMost deep learning-based action recognition models focus only on short-term motions, so the model often causes misjudgments of actions that are combined by multiple processes, such as long jump, high jump, etc. The proposal of Temporal Segment Networks (TSN) enables the network to capture long-term information in the video, but ignores that some unrelated frames or areas in the video can also cause great interference to action recognition. To solve this problem, a soft attention mechanism is introduced in TSN and a Spatial-Temporal Attention Temporal Segment Networks (STA-TSN), which retains the ability to capture long-term information and enables the network to adaptively focus on key features in space and time, is proposed. First, a multi-scale spatial focus feature enhancement strategy is proposed to fuse original convolution features with multi-scale spatial focus features obtained through a soft attention mechanism with spatial pyramid pooling. Second, a deep learning-based key frames exploration module, which utilizes a soft attention mechanism based on Long-Short Term Memory (LSTM) to adaptively learn temporal attention weights, is designed. Third, a temporal-attention regularization is developed to guide our STA-TSN to better realize the exploration of key frames. Finally, the experimental results show that our proposed STA-TSN outperforms TSN in the four public datasets UCF101, HMDB51, JHMDB and THUMOS14, as well as achieves state-of-the-art results.https://doi.org/10.1371/journal.pone.0265115
spellingShingle Guoan Yang
Yong Yang
Zhengzhi Lu
Junjie Yang
Deyang Liu
Chuanbo Zhou
Zien Fan
STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.
PLoS ONE
title STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.
title_full STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.
title_fullStr STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.
title_full_unstemmed STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.
title_short STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.
title_sort sta tsn spatial temporal attention temporal segment network for action recognition in video
url https://doi.org/10.1371/journal.pone.0265115
work_keys_str_mv AT guoanyang statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo
AT yongyang statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo
AT zhengzhilu statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo
AT junjieyang statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo
AT deyangliu statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo
AT chuanbozhou statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo
AT zienfan statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo