ConvFormer: Tracking by Fusing Convolution and Transformer Features
Current mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model’...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10176344/ |
_version_ | 1797771674080247808 |
---|---|
author | Chao Zhang |
author_facet | Chao Zhang |
author_sort | Chao Zhang |
collection | DOAJ |
description | Current mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model’s performance. Our framework aims to improve the feature extraction ability by combining the local feature extraction ability of CNN with the global feature extraction ability of the Transformer. To achieve synchronous feature extraction and fusion of the template and search region, we propose Mix Net Module (MNM), which achieves both global and local feature extraction and fusion for the template and search regions. Based on MNM, we stacked MNM modules and added a location head to complete the construction of the ConvFormer framework. Moreover, we designed a post-processing module to reduce the impact of tracker mistracking and improve the model’s robustness against interference from similar objects. Our framework achieved state-of-the-art performance on six benchmarks, including OTB2015, VOT2018, GOT-10k, LaSOT, TrackingNet, and UAV123. Notably, on the TrackingNet dataset, our tracker outperformed OSTrack by 1.4% with 83.2% precision. Additionally, on the LaSOT dataset, our tracker surpassed OSTrack by 2.6% with 77.4% precision. Finally, we conducted numerous ablation experiments to validate the efficacy of our approach. |
first_indexed | 2024-03-12T21:41:00Z |
format | Article |
id | doaj.art-03d959c77adc4ebb8e48e8b31b6d61e6 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-12T21:41:00Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-03d959c77adc4ebb8e48e8b31b6d61e62023-07-26T23:00:26ZengIEEEIEEE Access2169-35362023-01-0111748557486410.1109/ACCESS.2023.329359210176344ConvFormer: Tracking by Fusing Convolution and Transformer FeaturesChao Zhang0https://orcid.org/0000-0001-6076-836XDepartment of Computer Science, Beihang University, Beijing, ChinaCurrent mainstream single-object trackers adopt the Transformer as the backbone for target tracking. However, due to the Transformer’s limitations in local information acquisition and position encoding, we proposed a new tracking framework called ConvFormer to enhance the model’s performance. Our framework aims to improve the feature extraction ability by combining the local feature extraction ability of CNN with the global feature extraction ability of the Transformer. To achieve synchronous feature extraction and fusion of the template and search region, we propose Mix Net Module (MNM), which achieves both global and local feature extraction and fusion for the template and search regions. Based on MNM, we stacked MNM modules and added a location head to complete the construction of the ConvFormer framework. Moreover, we designed a post-processing module to reduce the impact of tracker mistracking and improve the model’s robustness against interference from similar objects. Our framework achieved state-of-the-art performance on six benchmarks, including OTB2015, VOT2018, GOT-10k, LaSOT, TrackingNet, and UAV123. Notably, on the TrackingNet dataset, our tracker outperformed OSTrack by 1.4% with 83.2% precision. Additionally, on the LaSOT dataset, our tracker surpassed OSTrack by 2.6% with 77.4% precision. Finally, we conducted numerous ablation experiments to validate the efficacy of our approach.https://ieeexplore.ieee.org/document/10176344/ConvFormersingle-object trackingtransformermixed net module |
spellingShingle | Chao Zhang ConvFormer: Tracking by Fusing Convolution and Transformer Features IEEE Access ConvFormer single-object tracking transformer mixed net module |
title | ConvFormer: Tracking by Fusing Convolution and Transformer Features |
title_full | ConvFormer: Tracking by Fusing Convolution and Transformer Features |
title_fullStr | ConvFormer: Tracking by Fusing Convolution and Transformer Features |
title_full_unstemmed | ConvFormer: Tracking by Fusing Convolution and Transformer Features |
title_short | ConvFormer: Tracking by Fusing Convolution and Transformer Features |
title_sort | convformer tracking by fusing convolution and transformer features |
topic | ConvFormer single-object tracking transformer mixed net module |
url | https://ieeexplore.ieee.org/document/10176344/ |
work_keys_str_mv | AT chaozhang convformertrackingbyfusingconvolutionandtransformerfeatures |