DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection

Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector that exploits spatio-temporal conditioning. Inspired by the diffusion model, Dif...

Full description

Bibliographic Details
Main Authors:	Si-Dong Roh, Ki-Seok Chung
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Conditioning coreset diffusion model spatio–temporal video object detection
Online Access:	https://ieeexplore.ieee.org/document/10299639/

_version_	1797635393553694720
author	Si-Dong Roh Ki-Seok Chung
author_facet	Si-Dong Roh Ki-Seok Chung
author_sort	Si-Dong Roh
collection	DOAJ
description	Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector that exploits spatio-temporal conditioning. Inspired by the diffusion model, DiffusionVID refines random noise boxes to obtain the original object boxes in a video sequence. To effectively refine the object boxes from the degraded images in the videos, we used three novel approaches: cascade refinement, dynamic coreset conditioning, and local batch refinement. The cascade refinement architecture progressively extracts information and refines boxes, whereas the dynamic coreset conditioning further improves the denoising quality using adaptive conditions based on the spatio-temporal coreset. Local batch refinement significantly improves the inference speed by exploiting GPU parallelism. On the standard and widely used ImageNet-VID benchmark, our DiffusionVID with the ResNet-101 and Swin-Base backbones achieves 86.9 mAP @ 46.6 FPS and 92.4 mAP @ 27.0 FPS, respectively, which is state-of-the-art performance. To the best of the authors’ knowledge, this is the first video object detector based on a diffusion model. The code and models are available at <uri>https://github.com/sdroh1027/DiffusionVID</uri>.
first_indexed	2024-03-11T12:21:26Z
format	Article
id	doaj.art-1cba3655577f4a63be6b627e1cecf123
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-11T12:21:26Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-1cba3655577f4a63be6b627e1cecf1232023-11-07T00:01:10ZengIEEEIEEE Access2169-35362023-01-011112143412144410.1109/ACCESS.2023.332834110299639DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object DetectionSi-Dong Roh0https://orcid.org/0000-0001-5961-948XKi-Seok Chung1https://orcid.org/0000-0002-2908-8443Department of Electronic Engineering, Hanyang University, Seoul, South KoreaDepartment of Electronic Engineering, Hanyang University, Seoul, South KoreaSeveral existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector that exploits spatio-temporal conditioning. Inspired by the diffusion model, DiffusionVID refines random noise boxes to obtain the original object boxes in a video sequence. To effectively refine the object boxes from the degraded images in the videos, we used three novel approaches: cascade refinement, dynamic coreset conditioning, and local batch refinement. The cascade refinement architecture progressively extracts information and refines boxes, whereas the dynamic coreset conditioning further improves the denoising quality using adaptive conditions based on the spatio-temporal coreset. Local batch refinement significantly improves the inference speed by exploiting GPU parallelism. On the standard and widely used ImageNet-VID benchmark, our DiffusionVID with the ResNet-101 and Swin-Base backbones achieves 86.9 mAP @ 46.6 FPS and 92.4 mAP @ 27.0 FPS, respectively, which is state-of-the-art performance. To the best of the authors’ knowledge, this is the first video object detector based on a diffusion model. The code and models are available at <uri>https://github.com/sdroh1027/DiffusionVID</uri>.https://ieeexplore.ieee.org/document/10299639/Conditioningcoresetdiffusion modelspatio–temporalvideo object detection
spellingShingle	Si-Dong Roh Ki-Seok Chung DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection IEEE Access Conditioning coreset diffusion model spatio–temporal video object detection
title	DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection
title_full	DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection
title_fullStr	DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection
title_full_unstemmed	DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection
title_short	DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection
title_sort	diffusionvid denoising object boxes with spatio x2013 temporal conditioning for video object detection
topic	Conditioning coreset diffusion model spatio–temporal video object detection
url	https://ieeexplore.ieee.org/document/10299639/
work_keys_str_mv	AT sidongroh diffusionviddenoisingobjectboxeswithspatiox2013temporalconditioningforvideoobjectdetection AT kiseokchung diffusionviddenoisingobjectboxeswithspatiox2013temporalconditioningforvideoobjectdetection

DiffusionVID: Denoising Object Boxes With Spatio&#x2013;Temporal Conditioning for Video Object Detection

Similar Items

DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection