Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection

Object detection is a crucial research topic in the fields of computer vision and artificial intelligence, involving the identification and classification of objects within images. Recent advancements in deep learning technologies, such as YOLO (You Only Look Once), Faster-R-CNN, and SSDs (Single Sh...

Full description

Bibliographic Details
Main Authors:	Jinhwan Son, Heechul Jung
Format:	Article
Language:	English
Published:	MDPI AG 2024-03-01
Series:	Applied Sciences
Subjects:	deep learning computer vision object detection auto-labeling
Online Access:	https://www.mdpi.com/2076-3417/14/6/2232

_version_	1797242216083619840
author	Jinhwan Son Heechul Jung
author_facet	Jinhwan Son Heechul Jung
author_sort	Jinhwan Son
collection	DOAJ
description	Object detection is a crucial research topic in the fields of computer vision and artificial intelligence, involving the identification and classification of objects within images. Recent advancements in deep learning technologies, such as YOLO (You Only Look Once), Faster-R-CNN, and SSDs (Single Shot Detectors), have demonstrated high performance in object detection. This study utilizes the YOLOv8 model for real-time object detection in environments requiring fast inference speeds, specifically in CCTV and automotive dashcam scenarios. Experiments were conducted using the ‘Multi-Image Identical Situation and Object Identification Data’ provided by AI Hub, consisting of multi-image datasets captured in identical situations using CCTV, dashcams, and smartphones. Object detection experiments were performed on three types of multi-image datasets captured in identical situations. Despite the utility of YOLO, there is a need for performance improvement in the AI Hub dataset. Grounding DINO, a zero-shot object detector with a high mAP performance, is employed. While efficient auto-labeling is possible with Grounding DINO, its processing speed is slower than YOLO, making it unsuitable for real-time object detection scenarios. This study conducts object detection experiments using publicly available labels and utilizes Grounding DINO as a teacher model for auto-labeling. The generated labels are then used to train YOLO as a student model, and performance is compared and analyzed. Experimental results demonstrate that using auto-generated labels for object detection does not lead to degradation in performance. The combination of auto-labeling and manual labeling significantly enhances performance. Additionally, an analysis of datasets containing data from various devices, including CCTV, dashcams, and smartphones, reveals the impact of different device types on the recognition accuracy for distinct devices. Through Grounding DINO, this study proves the efficacy of auto-labeling technology in contributing to efficiency and performance enhancement in the field of object detection, presenting practical applicability.
first_indexed	2024-04-24T18:35:41Z
format	Article
id	doaj.art-767d917f080b4807bee66e5d6f6fa3c5
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-04-24T18:35:41Z
publishDate	2024-03-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-767d917f080b4807bee66e5d6f6fa3c52024-03-27T13:19:02ZengMDPI AGApplied Sciences2076-34172024-03-01146223210.3390/app14062232Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object DetectionJinhwan Son0Heechul Jung1Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaDepartment of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaObject detection is a crucial research topic in the fields of computer vision and artificial intelligence, involving the identification and classification of objects within images. Recent advancements in deep learning technologies, such as YOLO (You Only Look Once), Faster-R-CNN, and SSDs (Single Shot Detectors), have demonstrated high performance in object detection. This study utilizes the YOLOv8 model for real-time object detection in environments requiring fast inference speeds, specifically in CCTV and automotive dashcam scenarios. Experiments were conducted using the ‘Multi-Image Identical Situation and Object Identification Data’ provided by AI Hub, consisting of multi-image datasets captured in identical situations using CCTV, dashcams, and smartphones. Object detection experiments were performed on three types of multi-image datasets captured in identical situations. Despite the utility of YOLO, there is a need for performance improvement in the AI Hub dataset. Grounding DINO, a zero-shot object detector with a high mAP performance, is employed. While efficient auto-labeling is possible with Grounding DINO, its processing speed is slower than YOLO, making it unsuitable for real-time object detection scenarios. This study conducts object detection experiments using publicly available labels and utilizes Grounding DINO as a teacher model for auto-labeling. The generated labels are then used to train YOLO as a student model, and performance is compared and analyzed. Experimental results demonstrate that using auto-generated labels for object detection does not lead to degradation in performance. The combination of auto-labeling and manual labeling significantly enhances performance. Additionally, an analysis of datasets containing data from various devices, including CCTV, dashcams, and smartphones, reveals the impact of different device types on the recognition accuracy for distinct devices. Through Grounding DINO, this study proves the efficacy of auto-labeling technology in contributing to efficiency and performance enhancement in the field of object detection, presenting practical applicability.https://www.mdpi.com/2076-3417/14/6/2232deep learningcomputer visionobject detectionauto-labeling
spellingShingle	Jinhwan Son Heechul Jung Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection Applied Sciences deep learning computer vision object detection auto-labeling
title	Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection
title_full	Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection
title_fullStr	Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection
title_full_unstemmed	Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection
title_short	Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection
title_sort	teacher student model using grounding dino and you only look once for multi sensor based object detection
topic	deep learning computer vision object detection auto-labeling
url	https://www.mdpi.com/2076-3417/14/6/2232
work_keys_str_mv	AT jinhwanson teacherstudentmodelusinggroundingdinoandyouonlylookonceformultisensorbasedobjectdetection AT heechuljung teacherstudentmodelusinggroundingdinoandyouonlylookonceformultisensorbasedobjectdetection

Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection

Similar Items