Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification
Given a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods h...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-08-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/10/15/5385 |
_version_ | 1797560318587568128 |
---|---|
author | Zheng Liu Feixiang Du Wang Li Xu Liu Qiang Zou |
author_facet | Zheng Liu Feixiang Du Wang Li Xu Liu Qiang Zou |
author_sort | Zheng Liu |
collection | DOAJ |
description | Given a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods have failed to make full use of the relationship between frames during feature extraction. In this work, we propose a plug-and-play non-local attention module (NLAM) for frame-level feature extraction. NLAM, based on global spatial attention and channel attention, helps the network to determine the location of the person in each frame. Besides, we propose a non-local temporal pooling (NLTP) method used for temporal features’ aggregation, which can effectively capture long-range and global dependencies among the frames of the video. Our model obtained impressive results on different datasets compared to the state-of-the-art methods. In particular, it achieved the rank-1 accuracy of 86.3% on the MARS (Motion Analysis and Re-identification Set) dataset without re-ranking, which is 1.4% higher than the state-of-the-art way. On the DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) dataset, our method also had an excellent performance of 95% rank-1 accuracy and 94.5% mAP (mean Average Precision). |
first_indexed | 2024-03-10T17:58:45Z |
format | Article |
id | doaj.art-506ca18c9f624bacbcbd28cecfb96a72 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T17:58:45Z |
publishDate | 2020-08-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-506ca18c9f624bacbcbd28cecfb96a722023-11-20T09:03:10ZengMDPI AGApplied Sciences2076-34172020-08-011015538510.3390/app10155385Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-IdentificationZheng Liu0Feixiang Du1Wang Li2Xu Liu3Qiang Zou4School of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaSchool of Microelectronics, Tianjin University, Tianjin 300350, ChinaGiven a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods have failed to make full use of the relationship between frames during feature extraction. In this work, we propose a plug-and-play non-local attention module (NLAM) for frame-level feature extraction. NLAM, based on global spatial attention and channel attention, helps the network to determine the location of the person in each frame. Besides, we propose a non-local temporal pooling (NLTP) method used for temporal features’ aggregation, which can effectively capture long-range and global dependencies among the frames of the video. Our model obtained impressive results on different datasets compared to the state-of-the-art methods. In particular, it achieved the rank-1 accuracy of 86.3% on the MARS (Motion Analysis and Re-identification Set) dataset without re-ranking, which is 1.4% higher than the state-of-the-art way. On the DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) dataset, our method also had an excellent performance of 95% rank-1 accuracy and 94.5% mAP (mean Average Precision).https://www.mdpi.com/2076-3417/10/15/5385person Re-IDvideonon-localspatial-temporal attention |
spellingShingle | Zheng Liu Feixiang Du Wang Li Xu Liu Qiang Zou Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification Applied Sciences person Re-ID video non-local spatial-temporal attention |
title | Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification |
title_full | Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification |
title_fullStr | Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification |
title_full_unstemmed | Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification |
title_short | Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification |
title_sort | non local spatial and temporal attention network for video based person re identification |
topic | person Re-ID video non-local spatial-temporal attention |
url | https://www.mdpi.com/2076-3417/10/15/5385 |
work_keys_str_mv | AT zhengliu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT feixiangdu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT wangli nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT xuliu nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification AT qiangzou nonlocalspatialandtemporalattentionnetworkforvideobasedpersonreidentification |