HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking

Siamese network-based trackers consider tracking as features cross-correlation between the target template and the search region. Therefore, feature representation plays an important role for constructing a high-performance tracker. However, all existing Siamese networks extract the deep but low-res...

Full description

Bibliographic Details
Main Authors:	Dawei Zhang, Zhonglong Zheng, Tianxiang Wang, Yiran He
Format:	Article
Language:	English
Published:	MDPI AG 2020-08-01
Series:	Sensors
Subjects:	Siamese network high-resolution representation multi-scale fusion visual tracking attention mechanisms deformable convolution
Online Access:	https://www.mdpi.com/1424-8220/20/17/4807

_version_	1797555545362661376
author	Dawei Zhang Zhonglong Zheng Tianxiang Wang Yiran He
author_facet	Dawei Zhang Zhonglong Zheng Tianxiang Wang Yiran He
author_sort	Dawei Zhang
collection	DOAJ
description	Siamese network-based trackers consider tracking as features cross-correlation between the target template and the search region. Therefore, feature representation plays an important role for constructing a high-performance tracker. However, all existing Siamese networks extract the deep but low-resolution features of the entire patch, which is not robust enough to estimate the target bounding box accurately. In this work, to address this issue, we propose a novel high-resolution Siamese network, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high-resolution representations. The resulting representation is semantically richer and spatially more precise by a simple yet effective multi-scale feature fusion strategy. Moreover, we exploit attention mechanisms to learn object-aware masks for adaptive feature refinement, and use deformable convolution to handle complex geometric transformations. This makes the target more discriminative against distractors and background. Without bells and whistles, extensive experiments on popular tracking benchmarks containing OTB100, UAV123, VOT2018 and LaSOT demonstrate that the proposed tracker achieves state-of-the-art performance and runs in real time, confirming its efficiency and effectiveness.
first_indexed	2024-03-10T16:49:02Z
format	Article
id	doaj.art-ad116e2aeb4f43e8a16e9ba8aa1ea0c6
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T16:49:02Z
publishDate	2020-08-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-ad116e2aeb4f43e8a16e9ba8aa1ea0c62023-11-20T11:23:15ZengMDPI AGSensors1424-82202020-08-012017480710.3390/s20174807HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object TrackingDawei Zhang0Zhonglong Zheng1Tianxiang Wang2Yiran He3Department of Computer Science, College of Mathematics and Computer Science, Zhejiang Normal University, No 688, Yingbin Road, Jinhua 321004, ChinaDepartment of Computer Science, College of Mathematics and Computer Science, Zhejiang Normal University, No 688, Yingbin Road, Jinhua 321004, ChinaDepartment of Computer Science, College of Mathematics and Computer Science, Zhejiang Normal University, No 688, Yingbin Road, Jinhua 321004, ChinaDepartment of Computer Science, College of Mathematics and Computer Science, Zhejiang Normal University, No 688, Yingbin Road, Jinhua 321004, ChinaSiamese network-based trackers consider tracking as features cross-correlation between the target template and the search region. Therefore, feature representation plays an important role for constructing a high-performance tracker. However, all existing Siamese networks extract the deep but low-resolution features of the entire patch, which is not robust enough to estimate the target bounding box accurately. In this work, to address this issue, we propose a novel high-resolution Siamese network, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high-resolution representations. The resulting representation is semantically richer and spatially more precise by a simple yet effective multi-scale feature fusion strategy. Moreover, we exploit attention mechanisms to learn object-aware masks for adaptive feature refinement, and use deformable convolution to handle complex geometric transformations. This makes the target more discriminative against distractors and background. Without bells and whistles, extensive experiments on popular tracking benchmarks containing OTB100, UAV123, VOT2018 and LaSOT demonstrate that the proposed tracker achieves state-of-the-art performance and runs in real time, confirming its efficiency and effectiveness.https://www.mdpi.com/1424-8220/20/17/4807Siamese networkhigh-resolution representationmulti-scale fusionvisual trackingattention mechanismsdeformable convolution
spellingShingle	Dawei Zhang Zhonglong Zheng Tianxiang Wang Yiran He HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking Sensors Siamese network high-resolution representation multi-scale fusion visual tracking attention mechanisms deformable convolution
title	HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking
title_full	HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking
title_fullStr	HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking
title_full_unstemmed	HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking
title_short	HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking
title_sort	hrom learning high resolution representation and object aware masks for visual object tracking
topic	Siamese network high-resolution representation multi-scale fusion visual tracking attention mechanisms deformable convolution
url	https://www.mdpi.com/1424-8220/20/17/4807
work_keys_str_mv	AT daweizhang hromlearninghighresolutionrepresentationandobjectawaremasksforvisualobjecttracking AT zhonglongzheng hromlearninghighresolutionrepresentationandobjectawaremasksforvisualobjecttracking AT tianxiangwang hromlearninghighresolutionrepresentationandobjectawaremasksforvisualobjecttracking AT yiranhe hromlearninghighresolutionrepresentationandobjectawaremasksforvisualobjecttracking

HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking

Similar Items