Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes
Recent Transformer-based object detectors have achieved remarkable performance on benchmark datasets, but few have addressed the real-world challenge of object detection in crowded scenes using transformers. This limitation stems from the fixed query set size of the transformer decoder, which restri...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10177153/ |
_version_ | 1797746150153912320 |
---|---|
author | Hyeong Kyu Choi Chong Keun Paik Hyun Woo Ko Min-Chul Park Hyunwoo J. Kim |
author_facet | Hyeong Kyu Choi Chong Keun Paik Hyun Woo Ko Min-Chul Park Hyunwoo J. Kim |
author_sort | Hyeong Kyu Choi |
collection | DOAJ |
description | Recent Transformer-based object detectors have achieved remarkable performance on benchmark datasets, but few have addressed the real-world challenge of object detection in crowded scenes using transformers. This limitation stems from the fixed query set size of the transformer decoder, which restricts the model’s inference capacity. To overcome this challenge, we propose Recurrent Detection Transformer (Recurrent DETR), an object detector that iterates the decoder block to render more predictions with a finite number of query tokens. Recurrent DETR can adaptively control the number of decoder block iterations based on the image’s crowdedness or complexity, resulting in a variable-size prediction set. This is enabled by our novel Pondering Hungarian Loss, which helps the model to learn when additional computation is required to identify all the objects in a crowded scene. We demonstrate the effectiveness of Recurrent DETR on two datasets: COCO 2017, which represents a standard setting, and CrowdHuman, which features a crowded setting. Our experiments on both datasets show that Recurrent DETR achieves significant performance gains of 0.8 AP and 0.4 AP, respectively, over its base architectures. Moreover, we conduct comprehensive analyses under different query set size constraints to provide a thorough evaluation of our proposed method. |
first_indexed | 2024-03-12T15:32:55Z |
format | Article |
id | doaj.art-e07a3bf8ea284797839bbe62ed40a841 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-12T15:32:55Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-e07a3bf8ea284797839bbe62ed40a8412023-08-09T23:00:39ZengIEEEIEEE Access2169-35362023-01-0111786237864310.1109/ACCESS.2023.329353210177153Recurrent DETR: Transformer-Based Object Detection for Crowded ScenesHyeong Kyu Choi0https://orcid.org/0000-0003-2090-9273Chong Keun Paik1Hyun Woo Ko2Min-Chul Park3https://orcid.org/0000-0002-8575-085XHyunwoo J. Kim4https://orcid.org/0000-0002-2181-9264Department of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaSamsung Electro-Mechanics, Suwon, Republic of KoreaDepartment of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaRecent Transformer-based object detectors have achieved remarkable performance on benchmark datasets, but few have addressed the real-world challenge of object detection in crowded scenes using transformers. This limitation stems from the fixed query set size of the transformer decoder, which restricts the model’s inference capacity. To overcome this challenge, we propose Recurrent Detection Transformer (Recurrent DETR), an object detector that iterates the decoder block to render more predictions with a finite number of query tokens. Recurrent DETR can adaptively control the number of decoder block iterations based on the image’s crowdedness or complexity, resulting in a variable-size prediction set. This is enabled by our novel Pondering Hungarian Loss, which helps the model to learn when additional computation is required to identify all the objects in a crowded scene. We demonstrate the effectiveness of Recurrent DETR on two datasets: COCO 2017, which represents a standard setting, and CrowdHuman, which features a crowded setting. Our experiments on both datasets show that Recurrent DETR achieves significant performance gains of 0.8 AP and 0.4 AP, respectively, over its base architectures. Moreover, we conduct comprehensive analyses under different query set size constraints to provide a thorough evaluation of our proposed method.https://ieeexplore.ieee.org/document/10177153/Computer visionobject detectiondetection transformersdynamic computation |
spellingShingle | Hyeong Kyu Choi Chong Keun Paik Hyun Woo Ko Min-Chul Park Hyunwoo J. Kim Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes IEEE Access Computer vision object detection detection transformers dynamic computation |
title | Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes |
title_full | Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes |
title_fullStr | Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes |
title_full_unstemmed | Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes |
title_short | Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes |
title_sort | recurrent detr transformer based object detection for crowded scenes |
topic | Computer vision object detection detection transformers dynamic computation |
url | https://ieeexplore.ieee.org/document/10177153/ |
work_keys_str_mv | AT hyeongkyuchoi recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes AT chongkeunpaik recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes AT hyunwooko recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes AT minchulpark recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes AT hyunwoojkim recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes |