An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention

Current multi-target multi-camera tracking algorithms demand increased requirements for re-identification accuracy and tracking reliability. This study proposed an improved end-to-end multi-target tracking algorithm that adapts to multi-view multi-scale scenes based on the self-attentive mechanism o...

Full description

Bibliographic Details
Main Authors:	Yong Hong, Deren Li, Shupei Luo, Xin Chen, Yi Yang, Mi Wang
Format:	Article
Language:	English
Published:	MDPI AG 2022-12-01
Series:	Remote Sensing
Subjects:	transformer self-attention multi-view multi-scale end-to-end multi-target tracking raster semantic map
Online Access:	https://www.mdpi.com/2072-4292/14/24/6354

_version_	1797455423088885760
author	Yong Hong Deren Li Shupei Luo Xin Chen Yi Yang Mi Wang
author_facet	Yong Hong Deren Li Shupei Luo Xin Chen Yi Yang Mi Wang
author_sort	Yong Hong
collection	DOAJ
description	Current multi-target multi-camera tracking algorithms demand increased requirements for re-identification accuracy and tracking reliability. This study proposed an improved end-to-end multi-target tracking algorithm that adapts to multi-view multi-scale scenes based on the self-attentive mechanism of the transformer’s encoder–decoder structure. A multi-dimensional feature extraction backbone network was combined with a self-built raster semantic map which was stored in the encoder for correlation and generated target position encoding and multi-dimensional feature vectors. The decoder incorporated four methods: spatial clustering and semantic filtering of multi-view targets; dynamic matching of multi-dimensional features; space–time logic-based multi-target tracking, and space–time convergence network (STCN)-based parameter passing. Through the fusion of multiple decoding methods, multi-camera targets were tracked in three dimensions: temporal logic, spatial logic, and feature matching. For the MOT17 dataset, this study’s method significantly outperformed the current state-of-the-art method by 2.2% on the multiple object tracking accuracy (MOTA) metric. Furthermore, this study proposed a retrospective mechanism for the first time and adopted a reverse-order processing method to optimize the historical mislabeled targets for improving the identification F1-score (IDF1). For the self-built dataset OVIT-MOT01, the IDF1 improved from 0.948 to 0.967, and the multi-camera tracking accuracy (MCTA) improved from 0.878 to 0.909, which significantly improved the continuous tracking accuracy and reliability.
first_indexed	2024-03-09T15:53:15Z
format	Article
id	doaj.art-15d18f4f74784a25941d2e6e57797fa9
institution	Directory Open Access Journal
issn	2072-4292
language	English
last_indexed	2024-03-09T15:53:15Z
publishDate	2022-12-01
publisher	MDPI AG
record_format	Article
series	Remote Sensing
spelling	doaj.art-15d18f4f74784a25941d2e6e57797fa92023-11-24T17:48:19ZengMDPI AGRemote Sensing2072-42922022-12-011424635410.3390/rs14246354An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-AttentionYong Hong0Deren Li1Shupei Luo2Xin Chen3Yi Yang4Mi Wang5State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, ChinaState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, ChinaWuhan Optics Valley Information Technology Co., Ltd., Wuhan 430068, ChinaWuhan Optics Valley Information Technology Co., Ltd., Wuhan 430068, ChinaWuhan Optics Valley Information Technology Co., Ltd., Wuhan 430068, ChinaState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, ChinaCurrent multi-target multi-camera tracking algorithms demand increased requirements for re-identification accuracy and tracking reliability. This study proposed an improved end-to-end multi-target tracking algorithm that adapts to multi-view multi-scale scenes based on the self-attentive mechanism of the transformer’s encoder–decoder structure. A multi-dimensional feature extraction backbone network was combined with a self-built raster semantic map which was stored in the encoder for correlation and generated target position encoding and multi-dimensional feature vectors. The decoder incorporated four methods: spatial clustering and semantic filtering of multi-view targets; dynamic matching of multi-dimensional features; space–time logic-based multi-target tracking, and space–time convergence network (STCN)-based parameter passing. Through the fusion of multiple decoding methods, multi-camera targets were tracked in three dimensions: temporal logic, spatial logic, and feature matching. For the MOT17 dataset, this study’s method significantly outperformed the current state-of-the-art method by 2.2% on the multiple object tracking accuracy (MOTA) metric. Furthermore, this study proposed a retrospective mechanism for the first time and adopted a reverse-order processing method to optimize the historical mislabeled targets for improving the identification F1-score (IDF1). For the self-built dataset OVIT-MOT01, the IDF1 improved from 0.948 to 0.967, and the multi-camera tracking accuracy (MCTA) improved from 0.878 to 0.909, which significantly improved the continuous tracking accuracy and reliability.https://www.mdpi.com/2072-4292/14/24/6354transformerself-attentionmulti-view multi-scaleend-to-endmulti-target trackingraster semantic map
spellingShingle	Yong Hong Deren Li Shupei Luo Xin Chen Yi Yang Mi Wang An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention Remote Sensing transformer self-attention multi-view multi-scale end-to-end multi-target tracking raster semantic map
title	An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
title_full	An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
title_fullStr	An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
title_full_unstemmed	An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
title_short	An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
title_sort	improved end to end multi target tracking method based on transformer self attention
topic	transformer self-attention multi-view multi-scale end-to-end multi-target tracking raster semantic map
url	https://www.mdpi.com/2072-4292/14/24/6354
work_keys_str_mv	AT yonghong animprovedendtoendmultitargettrackingmethodbasedontransformerselfattention AT derenli animprovedendtoendmultitargettrackingmethodbasedontransformerselfattention AT shupeiluo animprovedendtoendmultitargettrackingmethodbasedontransformerselfattention AT xinchen animprovedendtoendmultitargettrackingmethodbasedontransformerselfattention AT yiyang animprovedendtoendmultitargettrackingmethodbasedontransformerselfattention AT miwang animprovedendtoendmultitargettrackingmethodbasedontransformerselfattention AT yonghong improvedendtoendmultitargettrackingmethodbasedontransformerselfattention AT derenli improvedendtoendmultitargettrackingmethodbasedontransformerselfattention AT shupeiluo improvedendtoendmultitargettrackingmethodbasedontransformerselfattention AT xinchen improvedendtoendmultitargettrackingmethodbasedontransformerselfattention AT yiyang improvedendtoendmultitargettrackingmethodbasedontransformerselfattention AT miwang improvedendtoendmultitargettrackingmethodbasedontransformerselfattention

An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention

Similar Items