Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification

Given a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods h...

Full description

Bibliographic Details
Main Authors:	Zheng Liu, Feixiang Du, Wang Li, Xu Liu, Qiang Zou
Format:	Article
Language:	English
Published:	MDPI AG 2020-08-01
Series:	Applied Sciences
Subjects:	person Re-ID video non-local spatial-temporal attention
Online Access:	https://www.mdpi.com/2076-3417/10/15/5385

_version_	1797560318587568128
author	Zheng Liu Feixiang Du Wang Li Xu Liu Qiang Zou
author_facet	Zheng Liu Feixiang Du Wang Li Xu Liu Qiang Zou
author_sort	Zheng Liu
collection	DOAJ
description	Given a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods have failed to make full use of the relationship between frames during feature extraction. In this work, we propose a plug-and-play non-local attention module (NLAM) for frame-level feature extraction. NLAM, based on global spatial attention and channel attention, helps the network to determine the location of the person in each frame. Besides, we propose a non-local temporal pooling (NLTP) method used for temporal features’ aggregation, which can effectively capture long-range and global dependencies among the frames of the video. Our model obtained impressive results on different datasets compared to the state-of-the-art methods. In particular, it achieved the rank-1 accuracy of 86.3% on the MARS (Motion Analysis and Re-identification Set) dataset without re-ranking, which is 1.4% higher than the state-of-the-art way. On the DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) dataset, our method also had an excellent performance of 95% rank-1 accuracy and 94.5% mAP (mean Average Precision).
first_indexed	2024-03-10T17:58:45Z
format	Article
id	doaj.art-506ca18c9f624bacbcbd28cecfb96a72
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T17:58:45Z
publishDate	2020-08-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-506ca18c9f624bacbcbd28cecfb96a722023-11-20T09:03:10ZengMDPI AGApplied Sciences2076-34172020-08-011015538510.3390/app10155385Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-IdentificationZheng Liu0Feixiang Du1Wang Li2Xu Liu3Qiang Zou4School of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaGiven a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods have failed to make full use of the relationship between frames during feature extraction. In this work, we propose a plug-and-play non-local attention module (NLAM) for frame-level feature extraction. NLAM, based on global spatial attention and channel attention, helps the network to determine the location of the person in each frame. Besides, we propose a non-local temporal pooling (NLTP) method used for temporal features’ aggregation, which can effectively capture long-range and global dependencies among the frames of the video. Our model obtained impressive results on different datasets compared to the state-of-the-art methods. In particular, it achieved the rank-1 accuracy of 86.3% on the MARS (Motion Analysis and Re-identification Set) dataset without re-ranking, which is 1.4% higher than the state-of-the-art way. On the DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) dataset, our method also had an excellent performance of 95% rank-1 accuracy and 94.5% mAP (mean Average Precision).https://www.mdpi.com/2076-3417/10/15/5385person Re-IDvideonon-localspatial-temporal attention
spellingShingle	Zheng Liu Feixiang Du Wang Li Xu Liu Qiang Zou Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification Applied Sciences person Re-ID video non-local spatial-temporal attention
title	Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_full	Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_fullStr	Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_full_unstemmed	Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_short	Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_sort	non local spatial and temporal attention network for video based person re identification
topic	person Re-ID video non-local spatial-temporal attention
url	https://www.mdpi.com/2076-3417/10/15/5385
work_keys_str_mv	AT zhengliu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT feixiangdu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT wangli nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT xuliu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT qiangzou nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification

Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification

Similar Items