MTSCANet: Multi temporal resolution temporal semantic context aggregation network
Abstract Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a mu...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2023-04-01
|
Series: | IET Computer Vision |
Subjects: | |
Online Access: | https://doi.org/10.1049/cvi2.12163 |
_version_ | 1797846300670033920 |
---|---|
author | Haiping Zhang Conghao Ma Dongjin Yu Liming Guan Dongjing Wang Zepeng Hu Xu Liu |
author_facet | Haiping Zhang Conghao Ma Dongjin Yu Liming Guan Dongjing Wang Zepeng Hu Xu Liu |
author_sort | Haiping Zhang |
collection | DOAJ |
description | Abstract Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved. |
first_indexed | 2024-04-09T17:52:43Z |
format | Article |
id | doaj.art-93b5a6eb10ac436bb20302eb8e426603 |
institution | Directory Open Access Journal |
issn | 1751-9632 1751-9640 |
language | English |
last_indexed | 2024-04-09T17:52:43Z |
publishDate | 2023-04-01 |
publisher | Wiley |
record_format | Article |
series | IET Computer Vision |
spelling | doaj.art-93b5a6eb10ac436bb20302eb8e4266032023-04-15T11:16:52ZengWileyIET Computer Vision1751-96321751-96402023-04-0117336637810.1049/cvi2.12163MTSCANet: Multi temporal resolution temporal semantic context aggregation networkHaiping Zhang0Conghao Ma1Dongjin Yu2Liming Guan3Dongjing Wang4Zepeng Hu5Xu Liu6School of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Electronics and Information Hangzhou Dianzi University Hangzhou ChinaSchool of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Information Engineering Hangzhou Dianzi University Hangzhou ChinaSchool of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Computer Science Hangzhou Dianzi University Hangzhou ChinaSchool of Electronics and Information Hangzhou Dianzi University Hangzhou ChinaAbstract Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved.https://doi.org/10.1049/cvi2.12163computer visionconvolutional neural netslearning (artificial intelligence)neural net architecture |
spellingShingle | Haiping Zhang Conghao Ma Dongjin Yu Liming Guan Dongjing Wang Zepeng Hu Xu Liu MTSCANet: Multi temporal resolution temporal semantic context aggregation network IET Computer Vision computer vision convolutional neural nets learning (artificial intelligence) neural net architecture |
title | MTSCANet: Multi temporal resolution temporal semantic context aggregation network |
title_full | MTSCANet: Multi temporal resolution temporal semantic context aggregation network |
title_fullStr | MTSCANet: Multi temporal resolution temporal semantic context aggregation network |
title_full_unstemmed | MTSCANet: Multi temporal resolution temporal semantic context aggregation network |
title_short | MTSCANet: Multi temporal resolution temporal semantic context aggregation network |
title_sort | mtscanet multi temporal resolution temporal semantic context aggregation network |
topic | computer vision convolutional neural nets learning (artificial intelligence) neural net architecture |
url | https://doi.org/10.1049/cvi2.12163 |
work_keys_str_mv | AT haipingzhang mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork AT conghaoma mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork AT dongjinyu mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork AT limingguan mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork AT dongjingwang mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork AT zepenghu mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork AT xuliu mtscanetmultitemporalresolutiontemporalsemanticcontextaggregationnetwork |