Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes

Recent Transformer-based object detectors have achieved remarkable performance on benchmark datasets, but few have addressed the real-world challenge of object detection in crowded scenes using transformers. This limitation stems from the fixed query set size of the transformer decoder, which restri...

Full description

Bibliographic Details
Main Authors: Hyeong Kyu Choi, Chong Keun Paik, Hyun Woo Ko, Min-Chul Park, Hyunwoo J. Kim
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10177153/
_version_ 1797746150153912320
author Hyeong Kyu Choi
Chong Keun Paik
Hyun Woo Ko
Min-Chul Park
Hyunwoo J. Kim
author_facet Hyeong Kyu Choi
Chong Keun Paik
Hyun Woo Ko
Min-Chul Park
Hyunwoo J. Kim
author_sort Hyeong Kyu Choi
collection DOAJ
description Recent Transformer-based object detectors have achieved remarkable performance on benchmark datasets, but few have addressed the real-world challenge of object detection in crowded scenes using transformers. This limitation stems from the fixed query set size of the transformer decoder, which restricts the model’s inference capacity. To overcome this challenge, we propose Recurrent Detection Transformer (Recurrent DETR), an object detector that iterates the decoder block to render more predictions with a finite number of query tokens. Recurrent DETR can adaptively control the number of decoder block iterations based on the image’s crowdedness or complexity, resulting in a variable-size prediction set. This is enabled by our novel Pondering Hungarian Loss, which helps the model to learn when additional computation is required to identify all the objects in a crowded scene. We demonstrate the effectiveness of Recurrent DETR on two datasets: COCO 2017, which represents a standard setting, and CrowdHuman, which features a crowded setting. Our experiments on both datasets show that Recurrent DETR achieves significant performance gains of 0.8 AP and 0.4 AP, respectively, over its base architectures. Moreover, we conduct comprehensive analyses under different query set size constraints to provide a thorough evaluation of our proposed method.
first_indexed 2024-03-12T15:32:55Z
format Article
id doaj.art-e07a3bf8ea284797839bbe62ed40a841
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-12T15:32:55Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-e07a3bf8ea284797839bbe62ed40a8412023-08-09T23:00:39ZengIEEEIEEE Access2169-35362023-01-0111786237864310.1109/ACCESS.2023.329353210177153Recurrent DETR: Transformer-Based Object Detection for Crowded ScenesHyeong Kyu Choi0https://orcid.org/0000-0003-2090-9273Chong Keun Paik1Hyun Woo Ko2Min-Chul Park3https://orcid.org/0000-0002-8575-085XHyunwoo J. Kim4https://orcid.org/0000-0002-2181-9264Department of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaSamsung Electro-Mechanics, Suwon, Republic of KoreaDepartment of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Korea University, Seoul, Republic of KoreaRecent Transformer-based object detectors have achieved remarkable performance on benchmark datasets, but few have addressed the real-world challenge of object detection in crowded scenes using transformers. This limitation stems from the fixed query set size of the transformer decoder, which restricts the model’s inference capacity. To overcome this challenge, we propose Recurrent Detection Transformer (Recurrent DETR), an object detector that iterates the decoder block to render more predictions with a finite number of query tokens. Recurrent DETR can adaptively control the number of decoder block iterations based on the image’s crowdedness or complexity, resulting in a variable-size prediction set. This is enabled by our novel Pondering Hungarian Loss, which helps the model to learn when additional computation is required to identify all the objects in a crowded scene. We demonstrate the effectiveness of Recurrent DETR on two datasets: COCO 2017, which represents a standard setting, and CrowdHuman, which features a crowded setting. Our experiments on both datasets show that Recurrent DETR achieves significant performance gains of 0.8 AP and 0.4 AP, respectively, over its base architectures. Moreover, we conduct comprehensive analyses under different query set size constraints to provide a thorough evaluation of our proposed method.https://ieeexplore.ieee.org/document/10177153/Computer visionobject detectiondetection transformersdynamic computation
spellingShingle Hyeong Kyu Choi
Chong Keun Paik
Hyun Woo Ko
Min-Chul Park
Hyunwoo J. Kim
Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes
IEEE Access
Computer vision
object detection
detection transformers
dynamic computation
title Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes
title_full Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes
title_fullStr Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes
title_full_unstemmed Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes
title_short Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes
title_sort recurrent detr transformer based object detection for crowded scenes
topic Computer vision
object detection
detection transformers
dynamic computation
url https://ieeexplore.ieee.org/document/10177153/
work_keys_str_mv AT hyeongkyuchoi recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes
AT chongkeunpaik recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes
AT hyunwooko recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes
AT minchulpark recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes
AT hyunwoojkim recurrentdetrtransformerbasedobjectdetectionforcrowdedscenes