Exploiting spatio‐temporal knowledge for video action recognition

Abstract Action recognition has been a popular area of computer vision research in recent years. The goal of this task is to recognise human actions in video frames. Most existing methods often depend on the visual features and their relationships inside the videos. The extracted features only repre...

Full description

Bibliographic Details
Main Authors:	Huigang Zhang, Liuan Wang, Jun Sun
Format:	Article
Language:	English
Published:	Wiley 2023-03-01
Series:	IET Computer Vision
Subjects:	action recognition commonsense knowledge GCN STKM
Online Access:	https://doi.org/10.1049/cvi2.12154

_version_	1827975393103052800
author	Huigang Zhang Liuan Wang Jun Sun
author_facet	Huigang Zhang Liuan Wang Jun Sun
author_sort	Huigang Zhang
collection	DOAJ
description	Abstract Action recognition has been a popular area of computer vision research in recent years. The goal of this task is to recognise human actions in video frames. Most existing methods often depend on the visual features and their relationships inside the videos. The extracted features only represent the visual information of the current video itself and cannot represent the general knowledge of particular actions beyond the video. Thus, there are some deviations in these features, and the recognition performance still requires improvement. In this sudy, we present a novel spatio‐temporal knowledge module (STKM) to endow the current methods with commonsense knowledge. To this end, we first collect hybrid external knowledge from universal fields, which contains both visual and semantic information. Then graph convolution networks (GCN) are used to represent and aggregate this knowledge. The GCNs involve (i) a spatial graph to capture spatial relations and (ii) a temporal graph to capture serial occurrence relations among actions. By integrating knowledge and visual features, we can get better recognition results. Experiments on AVA, UCF101‐24 and JHMDB datasets show the robustness and generalisation ability of STKM. The results report a new state‐of‐the‐art 32.0 mAP on AVA v2.1. On UCF101‐24 and JHMDB datasets, our method also improves by 1.5 AP and 2.6 AP, respectively, over the baseline method.
first_indexed	2024-04-09T20:10:52Z
format	Article
id	doaj.art-5880c8049dd64bfcbf26d901b043f47d
institution	Directory Open Access Journal
issn	1751-9632 1751-9640
language	English
last_indexed	2024-04-09T20:10:52Z
publishDate	2023-03-01
publisher	Wiley
record_format	Article
series	IET Computer Vision
spelling	doaj.art-5880c8049dd64bfcbf26d901b043f47d2023-04-01T03:37:25ZengWileyIET Computer Vision1751-96321751-96402023-03-0117222223010.1049/cvi2.12154Exploiting spatio‐temporal knowledge for video action recognitionHuigang Zhang0Liuan Wang1Jun Sun2Fujitsu R&D Center Beijing ChinaFujitsu R&D Center Beijing ChinaFujitsu R&D Center Beijing ChinaAbstract Action recognition has been a popular area of computer vision research in recent years. The goal of this task is to recognise human actions in video frames. Most existing methods often depend on the visual features and their relationships inside the videos. The extracted features only represent the visual information of the current video itself and cannot represent the general knowledge of particular actions beyond the video. Thus, there are some deviations in these features, and the recognition performance still requires improvement. In this sudy, we present a novel spatio‐temporal knowledge module (STKM) to endow the current methods with commonsense knowledge. To this end, we first collect hybrid external knowledge from universal fields, which contains both visual and semantic information. Then graph convolution networks (GCN) are used to represent and aggregate this knowledge. The GCNs involve (i) a spatial graph to capture spatial relations and (ii) a temporal graph to capture serial occurrence relations among actions. By integrating knowledge and visual features, we can get better recognition results. Experiments on AVA, UCF101‐24 and JHMDB datasets show the robustness and generalisation ability of STKM. The results report a new state‐of‐the‐art 32.0 mAP on AVA v2.1. On UCF101‐24 and JHMDB datasets, our method also improves by 1.5 AP and 2.6 AP, respectively, over the baseline method.https://doi.org/10.1049/cvi2.12154action recognitioncommonsense knowledgeGCNSTKM
spellingShingle	Huigang Zhang Liuan Wang Jun Sun Exploiting spatio‐temporal knowledge for video action recognition IET Computer Vision action recognition commonsense knowledge GCN STKM
title	Exploiting spatio‐temporal knowledge for video action recognition
title_full	Exploiting spatio‐temporal knowledge for video action recognition
title_fullStr	Exploiting spatio‐temporal knowledge for video action recognition
title_full_unstemmed	Exploiting spatio‐temporal knowledge for video action recognition
title_short	Exploiting spatio‐temporal knowledge for video action recognition
title_sort	exploiting spatio temporal knowledge for video action recognition
topic	action recognition commonsense knowledge GCN STKM
url	https://doi.org/10.1049/cvi2.12154
work_keys_str_mv	AT huigangzhang exploitingspatiotemporalknowledgeforvideoactionrecognition AT liuanwang exploitingspatiotemporalknowledgeforvideoactionrecognition AT junsun exploitingspatiotemporalknowledgeforvideoactionrecognition

Exploiting spatio‐temporal knowledge for video action recognition

Similar Items