ConvFormer: Tracking by Fusing Convolution and Transformer Features

Current mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model&#x2019...

Full description

Bibliographic Details
Main Author: Chao Zhang
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10176344/
Description
Summary:Current mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model’s performance. Our framework aims to improve the feature extraction ability by combining the local feature extraction ability of CNN with the global feature extraction ability of the Transformer. To achieve synchronous feature extraction and fusion of the template and search region, we propose Mix Net Module (MNM), which achieves both global and local feature extraction and fusion for the template and search regions. Based on MNM, we stacked MNM modules and added a location head to complete the construction of the ConvFormer framework. Moreover, we designed a post-processing module to reduce the impact of tracker mistracking and improve the model’s robustness against interference from similar objects. Our framework achieved state-of-the-art performance on six benchmarks, including OTB2015, VOT2018, GOT-10k, LaSOT, TrackingNet, and UAV123. Notably, on the TrackingNet dataset, our tracker outperformed OSTrack by 1.4% with 83.2% precision. Additionally, on the LaSOT dataset, our tracker surpassed OSTrack by 2.6% with 77.4% precision. Finally, we conducted numerous ablation experiments to validate the efficacy of our approach.
ISSN:2169-3536