Multi‐dimensional weighted cross‐attention network in crowded scenes

Abstract Human detection in crowded scenes is one of the research components of crowd safety problem analysis, such as emergency warning and security monitoring platforms. Although the existing anchor‐free methods have fast inference speed, they are not suitable for object detection in crowded scene...

Full description

Bibliographic Details
Main Authors: Yefan Xie, Jiangbin Zheng, Xuan Hou, Irfan Raza Naqvi, Yue Xi, Nailiang Kuang
Format: Article
Language:English
Published: Wiley 2021-12-01
Series:IET Image Processing
Subjects:
Online Access:https://doi.org/10.1049/ipr2.12298
_version_ 1798027610472579072
author Yefan Xie
Jiangbin Zheng
Xuan Hou
Irfan Raza Naqvi
Yue Xi
Nailiang Kuang
author_facet Yefan Xie
Jiangbin Zheng
Xuan Hou
Irfan Raza Naqvi
Yue Xi
Nailiang Kuang
author_sort Yefan Xie
collection DOAJ
description Abstract Human detection in crowded scenes is one of the research components of crowd safety problem analysis, such as emergency warning and security monitoring platforms. Although the existing anchor‐free methods have fast inference speed, they are not suitable for object detection in crowded scenes due to the model's inability to predict the well‐fined object detection bounding boxes. This work proposes an end‐to‐end anchor‐free network, Multi‐dimensional Weighted Cross‐Attention Network (MANet), which can perform real‐time human detection in crowded scenes. Specifically, the Double‐flow Weighted Feature Cascade Module (DW‐FCM) is used in the extractor to highlight the contribution of features at different levels. The Triplet Cross Attention Module (TCAM) is used in the detector head to enhance the association dependence of multi‐dimension features, further strengthening human boundary features' discrimination ability at a fine‐grained level. Moreover, the strategy of Adaptively Opposite Thrust Mapping (AOTM) ground‐truth annotation is proposed to achieve bias correction of erroneous mappings and reduce the iterations of useless learning of the network. These strategies effectively alleviate the defect that the existing anchor‐free network cannot correctly distinguish and locate the individual human in crowded scenes. Compared with the anchor‐based detection method, there is no need to set anchor parameters manually, and the detection speed can satisfy the real‐time application. Finally, through extensive comparative experiments on CrowdHuman and WIDER FACE datasets, the results demonstrate that the improved strategy achieves the state‐of‐the‐art result in the anchor‐free methods.
first_indexed 2024-04-11T18:54:11Z
format Article
id doaj.art-7dad8689e7c14b8c9b305656b35d5263
institution Directory Open Access Journal
issn 1751-9659
1751-9667
language English
last_indexed 2024-04-11T18:54:11Z
publishDate 2021-12-01
publisher Wiley
record_format Article
series IET Image Processing
spelling doaj.art-7dad8689e7c14b8c9b305656b35d52632022-12-22T04:08:14ZengWileyIET Image Processing1751-96591751-96672021-12-0115143585359810.1049/ipr2.12298Multi‐dimensional weighted cross‐attention network in crowded scenesYefan Xie0Jiangbin Zheng1Xuan Hou2Irfan Raza Naqvi3Yue Xi4Nailiang Kuang5School of Computer Science and Engineering Northwestern Polytechnical University Xi'an PR ChinaSchool of Computer Science and Engineering Northwestern Polytechnical University Xi'an PR ChinaSchool of Computer Science and Engineering Northwestern Polytechnical University Xi'an PR ChinaSchool of Software Northwestern Polytechnical University Xi'an PR ChinaAeronautics Engineering College Air Force Engineering University of PLA Xi'an ChinaXi'an Microelectronics Technology Institute Xi'an PR ChinaAbstract Human detection in crowded scenes is one of the research components of crowd safety problem analysis, such as emergency warning and security monitoring platforms. Although the existing anchor‐free methods have fast inference speed, they are not suitable for object detection in crowded scenes due to the model's inability to predict the well‐fined object detection bounding boxes. This work proposes an end‐to‐end anchor‐free network, Multi‐dimensional Weighted Cross‐Attention Network (MANet), which can perform real‐time human detection in crowded scenes. Specifically, the Double‐flow Weighted Feature Cascade Module (DW‐FCM) is used in the extractor to highlight the contribution of features at different levels. The Triplet Cross Attention Module (TCAM) is used in the detector head to enhance the association dependence of multi‐dimension features, further strengthening human boundary features' discrimination ability at a fine‐grained level. Moreover, the strategy of Adaptively Opposite Thrust Mapping (AOTM) ground‐truth annotation is proposed to achieve bias correction of erroneous mappings and reduce the iterations of useless learning of the network. These strategies effectively alleviate the defect that the existing anchor‐free network cannot correctly distinguish and locate the individual human in crowded scenes. Compared with the anchor‐based detection method, there is no need to set anchor parameters manually, and the detection speed can satisfy the real‐time application. Finally, through extensive comparative experiments on CrowdHuman and WIDER FACE datasets, the results demonstrate that the improved strategy achieves the state‐of‐the‐art result in the anchor‐free methods.https://doi.org/10.1049/ipr2.12298Optical, image and video signal processingComputer vision and image processing techniquesMachine learning (artificial intelligence)
spellingShingle Yefan Xie
Jiangbin Zheng
Xuan Hou
Irfan Raza Naqvi
Yue Xi
Nailiang Kuang
Multi‐dimensional weighted cross‐attention network in crowded scenes
IET Image Processing
Optical, image and video signal processing
Computer vision and image processing techniques
Machine learning (artificial intelligence)
title Multi‐dimensional weighted cross‐attention network in crowded scenes
title_full Multi‐dimensional weighted cross‐attention network in crowded scenes
title_fullStr Multi‐dimensional weighted cross‐attention network in crowded scenes
title_full_unstemmed Multi‐dimensional weighted cross‐attention network in crowded scenes
title_short Multi‐dimensional weighted cross‐attention network in crowded scenes
title_sort multi dimensional weighted cross attention network in crowded scenes
topic Optical, image and video signal processing
Computer vision and image processing techniques
Machine learning (artificial intelligence)
url https://doi.org/10.1049/ipr2.12298
work_keys_str_mv AT yefanxie multidimensionalweightedcrossattentionnetworkincrowdedscenes
AT jiangbinzheng multidimensionalweightedcrossattentionnetworkincrowdedscenes
AT xuanhou multidimensionalweightedcrossattentionnetworkincrowdedscenes
AT irfanrazanaqvi multidimensionalweightedcrossattentionnetworkincrowdedscenes
AT yuexi multidimensionalweightedcrossattentionnetworkincrowdedscenes
AT nailiangkuang multidimensionalweightedcrossattentionnetworkincrowdedscenes