Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification

Given a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods h...

Full description

Bibliographic Details
Main Authors: Zheng Liu, Feixiang Du, Wang Li, Xu Liu, Qiang Zou
Format: Article
Language:English
Published: MDPI AG 2020-08-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/15/5385
_version_ 1797560318587568128
author Zheng Liu
Feixiang Du
Wang Li
Xu Liu
Qiang Zou
author_facet Zheng Liu
Feixiang Du
Wang Li
Xu Liu
Qiang Zou
author_sort Zheng Liu
collection DOAJ
description Given a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods have failed to make full use of the relationship between frames during feature extraction. In this work, we propose a plug-and-play non-local attention module (NLAM) for frame-level feature extraction. NLAM, based on global spatial attention and channel attention, helps the network to determine the location of the person in each frame. Besides, we propose a non-local temporal pooling (NLTP) method used for temporal features’ aggregation, which can effectively capture long-range and global dependencies among the frames of the video. Our model obtained impressive results on different datasets compared to the state-of-the-art methods. In particular, it achieved the rank-1 accuracy of 86.3% on the MARS (Motion Analysis and Re-identification Set) dataset without re-ranking, which is 1.4% higher than the state-of-the-art way. On the DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) dataset, our method also had an excellent performance of 95% rank-1 accuracy and 94.5% mAP (mean Average Precision).
first_indexed 2024-03-10T17:58:45Z
format Article
id doaj.art-506ca18c9f624bacbcbd28cecfb96a72
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T17:58:45Z
publishDate 2020-08-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-506ca18c9f624bacbcbd28cecfb96a722023-11-20T09:03:10ZengMDPI AGApplied Sciences2076-34172020-08-011015538510.3390/app10155385Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-IdentificationZheng Liu0Feixiang Du1Wang Li2Xu Liu3Qiang Zou4School of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaGiven a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods have failed to make full use of the relationship between frames during feature extraction. In this work, we propose a plug-and-play non-local attention module (NLAM) for frame-level feature extraction. NLAM, based on global spatial attention and channel attention, helps the network to determine the location of the person in each frame. Besides, we propose a non-local temporal pooling (NLTP) method used for temporal features’ aggregation, which can effectively capture long-range and global dependencies among the frames of the video. Our model obtained impressive results on different datasets compared to the state-of-the-art methods. In particular, it achieved the rank-1 accuracy of 86.3% on the MARS (Motion Analysis and Re-identification Set) dataset without re-ranking, which is 1.4% higher than the state-of-the-art way. On the DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) dataset, our method also had an excellent performance of 95% rank-1 accuracy and 94.5% mAP (mean Average Precision).https://www.mdpi.com/2076-3417/10/15/5385person Re-IDvideonon-localspatial-temporal attention
spellingShingle Zheng Liu
Feixiang Du
Wang Li
Xu Liu
Qiang Zou
Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
Applied Sciences
person Re-ID
video
non-local
spatial-temporal attention
title Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_full Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_fullStr Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_full_unstemmed Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_short Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
title_sort non local spatial and temporal attention network for video based person re identification
topic person Re-ID
video
non-local
spatial-temporal attention
url https://www.mdpi.com/2076-3417/10/15/5385
work_keys_str_mv AT zhengliu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification
AT feixiangdu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification
AT wangli nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification
AT xuliu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification
AT qiangzou nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification