ConvFormer: Tracking by Fusing Convolution and Transformer Features

Current mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model&#x2019...

Full description

Bibliographic Details
Main Author:	Chao Zhang
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	ConvFormer single-object tracking transformer mixed net module
Online Access:	https://ieeexplore.ieee.org/document/10176344/

_version_	1797771674080247808
author	Chao Zhang
author_facet	Chao Zhang
author_sort	Chao Zhang
collection	DOAJ
description	Current mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model’s performance. Our framework aims to improve the feature extraction ability by combining the local feature extraction ability of CNN with the global feature extraction ability of the Transformer. To achieve synchronous feature extraction and fusion of the template and search region, we propose Mix Net Module (MNM), which achieves both global and local feature extraction and fusion for the template and search regions. Based on MNM, we stacked MNM modules and added a location head to complete the construction of the ConvFormer framework. Moreover, we designed a post-processing module to reduce the impact of tracker mistracking and improve the model’s robustness against interference from similar objects. Our framework achieved state-of-the-art performance on six benchmarks, including OTB2015, VOT2018, GOT-10k, LaSOT, TrackingNet, and UAV123. Notably, on the TrackingNet dataset, our tracker outperformed OSTrack by 1.4% with 83.2% precision. Additionally, on the LaSOT dataset, our tracker surpassed OSTrack by 2.6% with 77.4% precision. Finally, we conducted numerous ablation experiments to validate the efficacy of our approach.
first_indexed	2024-03-12T21:41:00Z
format	Article
id	doaj.art-03d959c77adc4ebb8e48e8b31b6d61e6
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-12T21:41:00Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-03d959c77adc4ebb8e48e8b31b6d61e62023-07-26T23:00:26ZengIEEEIEEE Access2169-35362023-01-0111748557486410.1109/ACCESS.2023.329359210176344ConvFormer: Tracking by Fusing Convolution and Transformer FeaturesChao Zhang0https://orcid.org/0000-0001-6076-836XDepartment of Computer Science, Beihang University, Beijing, ChinaCurrent mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model’s performance. Our framework aims to improve the feature extraction ability by combining the local feature extraction ability of CNN with the global feature extraction ability of the Transformer. To achieve synchronous feature extraction and fusion of the template and search region, we propose Mix Net Module (MNM), which achieves both global and local feature extraction and fusion for the template and search regions. Based on MNM, we stacked MNM modules and added a location head to complete the construction of the ConvFormer framework. Moreover, we designed a post-processing module to reduce the impact of tracker mistracking and improve the model’s robustness against interference from similar objects. Our framework achieved state-of-the-art performance on six benchmarks, including OTB2015, VOT2018, GOT-10k, LaSOT, TrackingNet, and UAV123. Notably, on the TrackingNet dataset, our tracker outperformed OSTrack by 1.4% with 83.2% precision. Additionally, on the LaSOT dataset, our tracker surpassed OSTrack by 2.6% with 77.4% precision. Finally, we conducted numerous ablation experiments to validate the efficacy of our approach.https://ieeexplore.ieee.org/document/10176344/ConvFormersingle-object trackingtransformermixed net module
spellingShingle	Chao Zhang ConvFormer: Tracking by Fusing Convolution and Transformer Features IEEE Access ConvFormer single-object tracking transformer mixed net module
title	ConvFormer: Tracking by Fusing Convolution and Transformer Features
title_full	ConvFormer: Tracking by Fusing Convolution and Transformer Features
title_fullStr	ConvFormer: Tracking by Fusing Convolution and Transformer Features
title_full_unstemmed	ConvFormer: Tracking by Fusing Convolution and Transformer Features
title_short	ConvFormer: Tracking by Fusing Convolution and Transformer Features
title_sort	convformer tracking by fusing convolution and transformer features
topic	ConvFormer single-object tracking transformer mixed net module
url	https://ieeexplore.ieee.org/document/10176344/
work_keys_str_mv	AT chaozhang convformertrackingbyfusingconvolutionandtransformerfeatures

ConvFormer: Tracking by Fusing Convolution and Transformer Features

Similar Items