Cross-scale Vision Transformer for crowd localization

Crowd localization can provide the positions of individuals and the total number of people, which has great application value for security monitoring and public management, meanwhile it meets the challenges of lighting, occlusion and perspective effect. In recent times, Transformer has been applied...

Full description

Bibliographic Details
Main Authors:	Shuang Liu, Yu Lian, Zhong Zhang, Baihua Xiao, Tariq S. Durrani
Format:	Article
Language:	English
Published:	Elsevier 2024-02-01
Series:	Journal of King Saud University: Computer and Information Sciences
Subjects:	Crowd localization Multi-scale information fusion Long-range context dependencies Adaptive windows
Online Access:	http://www.sciencedirect.com/science/article/pii/S1319157824000612

_version_	1797272461402701824
author	Shuang Liu Yu Lian Zhong Zhang Baihua Xiao Tariq S. Durrani
author_facet	Shuang Liu Yu Lian Zhong Zhang Baihua Xiao Tariq S. Durrani
author_sort	Shuang Liu
collection	DOAJ
description	Crowd localization can provide the positions of individuals and the total number of people, which has great application value for security monitoring and public management, meanwhile it meets the challenges of lighting, occlusion and perspective effect. In recent times, Transformer has been applied in crowd localization to overcome these challenges. Yet such kind of methods only consider to integrate the multi-scale information once, which results in incomplete multi-scale information fusion. In this paper, we propose a novel Transformer network named Cross-scale Vision Transformer (CsViT) for crowd localization, which simultaneously fuses multi-scale information during both the encoder and decoder stages and meanwhile building the long-range context dependencies on the combined feature maps. To this end, we design the multi-scale encoder to fuse the feature maps of multiple scales at corresponding positions so as to obtain the combined feature maps, and meanwhile design the multi-scale decoder to integrate the tokens at multiple scales when modeling the long-range context dependencies. Furthermore, we propose Multi-scale SSIM (MsSSIM) loss to adaptively compute head regions and optimize the similarity at multiple scales. Specifically, we set the adaptive windows with different scales for each head and compute the loss values within these windows so as to enhance the accuracy of the predicted distance transform map. We perform comprehensive experiments on five public datasets, and the results obtained validate the effectiveness of our method.
first_indexed	2024-03-07T14:29:41Z
format	Article
id	doaj.art-0af4e46606a24724a2a6eb2f48c13fa0
institution	Directory Open Access Journal
issn	1319-1578
language	English
last_indexed	2024-03-07T14:29:41Z
publishDate	2024-02-01
publisher	Elsevier
record_format	Article
series	Journal of King Saud University: Computer and Information Sciences
spelling	doaj.art-0af4e46606a24724a2a6eb2f48c13fa02024-03-06T05:25:48ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782024-02-01362101972Cross-scale Vision Transformer for crowd localizationShuang Liu0Yu Lian1Zhong Zhang2Baihua Xiao3Tariq S. Durrani4Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin, 300387, ChinaTianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin, 300387, ChinaTianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin, 300387, China; Corresponding author.The State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, ChinaDepartment of Electronic and Electrical Engineering, University of Strathclyde, Glasgow Scotland, UKCrowd localization can provide the positions of individuals and the total number of people, which has great application value for security monitoring and public management, meanwhile it meets the challenges of lighting, occlusion and perspective effect. In recent times, Transformer has been applied in crowd localization to overcome these challenges. Yet such kind of methods only consider to integrate the multi-scale information once, which results in incomplete multi-scale information fusion. In this paper, we propose a novel Transformer network named Cross-scale Vision Transformer (CsViT) for crowd localization, which simultaneously fuses multi-scale information during both the encoder and decoder stages and meanwhile building the long-range context dependencies on the combined feature maps. To this end, we design the multi-scale encoder to fuse the feature maps of multiple scales at corresponding positions so as to obtain the combined feature maps, and meanwhile design the multi-scale decoder to integrate the tokens at multiple scales when modeling the long-range context dependencies. Furthermore, we propose Multi-scale SSIM (MsSSIM) loss to adaptively compute head regions and optimize the similarity at multiple scales. Specifically, we set the adaptive windows with different scales for each head and compute the loss values within these windows so as to enhance the accuracy of the predicted distance transform map. We perform comprehensive experiments on five public datasets, and the results obtained validate the effectiveness of our method.http://www.sciencedirect.com/science/article/pii/S1319157824000612Crowd localizationMulti-scale information fusionLong-range context dependenciesAdaptive windows
spellingShingle	Shuang Liu Yu Lian Zhong Zhang Baihua Xiao Tariq S. Durrani Cross-scale Vision Transformer for crowd localization Journal of King Saud University: Computer and Information Sciences Crowd localization Multi-scale information fusion Long-range context dependencies Adaptive windows
title	Cross-scale Vision Transformer for crowd localization
title_full	Cross-scale Vision Transformer for crowd localization
title_fullStr	Cross-scale Vision Transformer for crowd localization
title_full_unstemmed	Cross-scale Vision Transformer for crowd localization
title_short	Cross-scale Vision Transformer for crowd localization
title_sort	cross scale vision transformer for crowd localization
topic	Crowd localization Multi-scale information fusion Long-range context dependencies Adaptive windows
url	http://www.sciencedirect.com/science/article/pii/S1319157824000612
work_keys_str_mv	AT shuangliu crossscalevisiontransformerforcrowdlocalization AT yulian crossscalevisiontransformerforcrowdlocalization AT zhongzhang crossscalevisiontransformerforcrowdlocalization AT baihuaxiao crossscalevisiontransformerforcrowdlocalization AT tariqsdurrani crossscalevisiontransformerforcrowdlocalization

Cross-scale Vision Transformer for crowd localization

Similar Items