ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection

Object detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential module...

Full description

Bibliographic Details
Main Authors:	Hye-Jin Park, Ji-Woo Kang, Byung-Gyu Kim
Format:	Article
Language:	English
Published:	MDPI AG 2023-04-01
Series:	Sensors
Subjects:	object detection feature pyramid network scale sequence (<i>S</i><sup>2</sup>) feature convolutional neural network (CNN) deep learning
Online Access:	https://www.mdpi.com/1424-8220/23/9/4432

_version_	1797601686462660608
author	Hye-Jin Park Ji-Woo Kang Byung-Gyu Kim
author_facet	Hye-Jin Park Ji-Woo Kang Byung-Gyu Kim
author_sort	Hye-Jin Park
collection	DOAJ
description	Object detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential modules for object detection models to consider various object scales. However, the AP for small objects is lower than the AP for medium and large objects. It is difficult to recognize small objects because they do not have sufficient information, and information is lost in deeper CNN layers. This paper proposes a new FPN model named ssFPN (scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature-based feature pyramid network) to detect multi-scale objects, especially small objects. We propose a new scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature that is extracted by 3D convolution on the level of the FPN. It is defined and extracted from the FPN to strengthen the information on small objects based on scale-space theory. Motivated by this theory, the FPN is regarded as a scale space and extracts a scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature by three-dimensional convolution on the level axis of the FPN. The defined feature is basically scale-invariant and is built on a high-resolution pyramid feature map for small objects. Additionally, the deigned <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature can be extended to most object detection models based on FPNs. We also designed a feature-level super-resolution approach to show the efficiency of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature. We verified that the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature could improve the classification accuracy for low-resolution images by training a feature-level super-resolution model. To demonstrate the effect of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature, experiments on the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature built-in object detection approach including both one-stage and two-stage models were conducted on the MS COCO dataset. For the two-stage object detection models Faster R-CNN and Mask R-CNN with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, AP improvements of up to 1.6% and 1.4%, respectively, were achieved. Additionally, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> of each model was improved by 1.2% and 1.1%, respectively. Furthermore, the one-stage object detection models in the YOLO series were improved. For YOLOv4-P5, YOLOv4-P6, YOLOR-P6, YOLOR-W6, and YOLOR-D6 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, 0.9%, 0.5%, 0.5%, 0.1%, and 0.1% AP improvements were observed. For small object detection, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> increased by 1.1%, 1.1%, 0.9%, 0.4%, and 0.1%, respectively. Experiments using the feature-level super-resolution approach with the proposed scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature were conducted on the CIFAR-100 dataset. By training the feature-level super-resolution model, we verified that ResNet-101 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature trained on LR images achieved a 55.2% classification accuracy, which was 1.6% higher than for ResNet-101 trained on HR images.
first_indexed	2024-03-11T04:07:06Z
format	Article
id	doaj.art-adcdb748228a40189e94c022a91ffbf7
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-11T04:07:06Z
publishDate	2023-04-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-adcdb748228a40189e94c022a91ffbf72023-11-17T23:44:19ZengMDPI AGSensors1424-82202023-04-01239443210.3390/s23094432ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object DetectionHye-Jin Park0Ji-Woo Kang1Byung-Gyu Kim2Department of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaDepartment of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaDepartment of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaObject detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential modules for object detection models to consider various object scales. However, the AP for small objects is lower than the AP for medium and large objects. It is difficult to recognize small objects because they do not have sufficient information, and information is lost in deeper CNN layers. This paper proposes a new FPN model named ssFPN (scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature-based feature pyramid network) to detect multi-scale objects, especially small objects. We propose a new scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature that is extracted by 3D convolution on the level of the FPN. It is defined and extracted from the FPN to strengthen the information on small objects based on scale-space theory. Motivated by this theory, the FPN is regarded as a scale space and extracts a scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature by three-dimensional convolution on the level axis of the FPN. The defined feature is basically scale-invariant and is built on a high-resolution pyramid feature map for small objects. Additionally, the deigned <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature can be extended to most object detection models based on FPNs. We also designed a feature-level super-resolution approach to show the efficiency of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature. We verified that the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature could improve the classification accuracy for low-resolution images by training a feature-level super-resolution model. To demonstrate the effect of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature, experiments on the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature built-in object detection approach including both one-stage and two-stage models were conducted on the MS COCO dataset. For the two-stage object detection models Faster R-CNN and Mask R-CNN with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, AP improvements of up to 1.6% and 1.4%, respectively, were achieved. Additionally, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> of each model was improved by 1.2% and 1.1%, respectively. Furthermore, the one-stage object detection models in the YOLO series were improved. For YOLOv4-P5, YOLOv4-P6, YOLOR-P6, YOLOR-W6, and YOLOR-D6 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, 0.9%, 0.5%, 0.5%, 0.1%, and 0.1% AP improvements were observed. For small object detection, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> increased by 1.1%, 1.1%, 0.9%, 0.4%, and 0.1%, respectively. Experiments using the feature-level super-resolution approach with the proposed scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature were conducted on the CIFAR-100 dataset. By training the feature-level super-resolution model, we verified that ResNet-101 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature trained on LR images achieved a 55.2% classification accuracy, which was 1.6% higher than for ResNet-101 trained on HR images.https://www.mdpi.com/1424-8220/23/9/4432object detectionfeature pyramid networkscale sequence (<i>S</i><sup>2</sup>) featureconvolutional neural network (CNN)deep learning
spellingShingle	Hye-Jin Park Ji-Woo Kang Byung-Gyu Kim ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection Sensors object detection feature pyramid network scale sequence (<i>S</i><sup>2</sup>) feature convolutional neural network (CNN) deep learning
title	ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_full	ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_fullStr	ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_full_unstemmed	ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_short	ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_sort	ssfpn scale sequence i s i sup 2 sup feature based feature pyramid network for object detection
topic	object detection feature pyramid network scale sequence (<i>S</i><sup>2</sup>) feature convolutional neural network (CNN) deep learning
url	https://www.mdpi.com/1424-8220/23/9/4432
work_keys_str_mv	AT hyejinpark ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection AT jiwookang ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection AT byunggyukim ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection

ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection

Similar Items