STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.
Most deep learning-based action recognition models focus only on short-term motions, so the model often causes misjudgments of actions that are combined by multiple processes, such as long jump, high jump, etc. The proposal of Temporal Segment Networks (TSN) enables the network to capture long-term...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2022-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0265115 |
_version_ | 1828385929930211328 |
---|---|
author | Guoan Yang Yong Yang Zhengzhi Lu Junjie Yang Deyang Liu Chuanbo Zhou Zien Fan |
author_facet | Guoan Yang Yong Yang Zhengzhi Lu Junjie Yang Deyang Liu Chuanbo Zhou Zien Fan |
author_sort | Guoan Yang |
collection | DOAJ |
description | Most deep learning-based action recognition models focus only on short-term motions, so the model often causes misjudgments of actions that are combined by multiple processes, such as long jump, high jump, etc. The proposal of Temporal Segment Networks (TSN) enables the network to capture long-term information in the video, but ignores that some unrelated frames or areas in the video can also cause great interference to action recognition. To solve this problem, a soft attention mechanism is introduced in TSN and a Spatial-Temporal Attention Temporal Segment Networks (STA-TSN), which retains the ability to capture long-term information and enables the network to adaptively focus on key features in space and time, is proposed. First, a multi-scale spatial focus feature enhancement strategy is proposed to fuse original convolution features with multi-scale spatial focus features obtained through a soft attention mechanism with spatial pyramid pooling. Second, a deep learning-based key frames exploration module, which utilizes a soft attention mechanism based on Long-Short Term Memory (LSTM) to adaptively learn temporal attention weights, is designed. Third, a temporal-attention regularization is developed to guide our STA-TSN to better realize the exploration of key frames. Finally, the experimental results show that our proposed STA-TSN outperforms TSN in the four public datasets UCF101, HMDB51, JHMDB and THUMOS14, as well as achieves state-of-the-art results. |
first_indexed | 2024-12-10T05:28:27Z |
format | Article |
id | doaj.art-1f771825323b46f28f2d6fcfc8625921 |
institution | Directory Open Access Journal |
issn | 1932-6203 |
language | English |
last_indexed | 2024-12-10T05:28:27Z |
publishDate | 2022-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj.art-1f771825323b46f28f2d6fcfc86259212022-12-22T02:00:37ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01173e026511510.1371/journal.pone.0265115STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.Guoan YangYong YangZhengzhi LuJunjie YangDeyang LiuChuanbo ZhouZien FanMost deep learning-based action recognition models focus only on short-term motions, so the model often causes misjudgments of actions that are combined by multiple processes, such as long jump, high jump, etc. The proposal of Temporal Segment Networks (TSN) enables the network to capture long-term information in the video, but ignores that some unrelated frames or areas in the video can also cause great interference to action recognition. To solve this problem, a soft attention mechanism is introduced in TSN and a Spatial-Temporal Attention Temporal Segment Networks (STA-TSN), which retains the ability to capture long-term information and enables the network to adaptively focus on key features in space and time, is proposed. First, a multi-scale spatial focus feature enhancement strategy is proposed to fuse original convolution features with multi-scale spatial focus features obtained through a soft attention mechanism with spatial pyramid pooling. Second, a deep learning-based key frames exploration module, which utilizes a soft attention mechanism based on Long-Short Term Memory (LSTM) to adaptively learn temporal attention weights, is designed. Third, a temporal-attention regularization is developed to guide our STA-TSN to better realize the exploration of key frames. Finally, the experimental results show that our proposed STA-TSN outperforms TSN in the four public datasets UCF101, HMDB51, JHMDB and THUMOS14, as well as achieves state-of-the-art results.https://doi.org/10.1371/journal.pone.0265115 |
spellingShingle | Guoan Yang Yong Yang Zhengzhi Lu Junjie Yang Deyang Liu Chuanbo Zhou Zien Fan STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. PLoS ONE |
title | STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. |
title_full | STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. |
title_fullStr | STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. |
title_full_unstemmed | STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. |
title_short | STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. |
title_sort | sta tsn spatial temporal attention temporal segment network for action recognition in video |
url | https://doi.org/10.1371/journal.pone.0265115 |
work_keys_str_mv | AT guoanyang statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo AT yongyang statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo AT zhengzhilu statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo AT junjieyang statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo AT deyangliu statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo AT chuanbozhou statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo AT zienfan statsnspatialtemporalattentiontemporalsegmentnetworkforactionrecognitioninvideo |