MTSCANet: Multi temporal resolution temporal semantic context aggregation network

Abstract Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a mu...

Full description

Bibliographic Details
Main Authors: Haiping Zhang, Conghao Ma, Dongjin Yu, Liming Guan, Dongjing Wang, Zepeng Hu, Xu Liu
Format: Article
Language:English
Published: Wiley 2023-04-01
Series:IET Computer Vision
Subjects:
Online Access:https://doi.org/10.1049/cvi2.12163
_version_ 1797846300670033920
author Haiping Zhang
Conghao Ma
Dongjin Yu
Liming Guan
Dongjing Wang
Zepeng Hu
Xu Liu
author_facet Haiping Zhang
Conghao Ma
Dongjin Yu
Liming Guan
Dongjing Wang
Zepeng Hu
Xu Liu
author_sort Haiping Zhang
collection DOAJ
description Abstract Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved.
first_indexed 2024-04-09T17:52:43Z
format Article
id doaj.art-93b5a6eb10ac436bb20302eb8e426603
institution Directory Open Access Journal
issn 1751-9632
1751-9640
language English
last_indexed 2024-04-09T17:52:43Z
publishDate 2023-04-01
publisher Wiley
record_format Article
series IET Computer Vision
spelling doaj.art-93b5a6eb10ac436bb20302eb8e4266032023-04-15T11:16:52ZengWileyIET Computer Vision1751-96321751-96402023-04-0117336637810.1049/cvi2.12163MTSCANet: Multi temporal resolution temporal semantic context aggregation networkHaiping Zhang0Conghao Ma1Dongjin Yu2Liming Guan3Dongjing Wang4Zepeng Hu5Xu Liu6School of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Electronics and Information Hangzhou Dianzi University Hangzhou ChinaSchool of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Information Engineering Hangzhou Dianzi University Hangzhou ChinaSchool of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Electronics and Information Hangzhou Dianzi University Hangzhou ChinaAbstract Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved.https://doi.org/10.1049/cvi2.12163computer visionconvolutional neural netslearning (artificial intelligence)neural net architecture
spellingShingle Haiping Zhang
Conghao Ma
Dongjin Yu
Liming Guan
Dongjing Wang
Zepeng Hu
Xu Liu
MTSCANet: Multi temporal resolution temporal semantic context aggregation network
IET Computer Vision
computer vision
convolutional neural nets
learning (artificial intelligence)
neural net architecture
title MTSCANet: Multi temporal resolution temporal semantic context aggregation network
title_full MTSCANet: Multi temporal resolution temporal semantic context aggregation network
title_fullStr MTSCANet: Multi temporal resolution temporal semantic context aggregation network
title_full_unstemmed MTSCANet: Multi temporal resolution temporal semantic context aggregation network
title_short MTSCANet: Multi temporal resolution temporal semantic context aggregation network
title_sort mtscanet multi temporal resolution temporal semantic context aggregation network
topic computer vision
convolutional neural nets
learning (artificial intelligence)
neural net architecture
url https://doi.org/10.1049/cvi2.12163
work_keys_str_mv AT haipingzhang mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork
AT conghaoma mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork
AT dongjinyu mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork
AT limingguan mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork
AT dongjingwang mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork
AT zepenghu mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork
AT xuliu mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork