ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection

Object detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential module...

Full description

Bibliographic Details
Main Authors: Hye-Jin Park, Ji-Woo Kang, Byung-Gyu Kim
Format: Article
Language:English
Published: MDPI AG 2023-04-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/23/9/4432
_version_ 1797601686462660608
author Hye-Jin Park
Ji-Woo Kang
Byung-Gyu Kim
author_facet Hye-Jin Park
Ji-Woo Kang
Byung-Gyu Kim
author_sort Hye-Jin Park
collection DOAJ
description Object detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential modules for object detection models to consider various object scales. However, the AP for small objects is lower than the AP for medium and large objects. It is difficult to recognize small objects because they do not have sufficient information, and information is lost in deeper CNN layers. This paper proposes a new FPN model named ssFPN (scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature-based feature pyramid network) to detect multi-scale objects, especially small objects. We propose a new scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature that is extracted by 3D convolution on the level of the FPN. It is defined and extracted from the FPN to strengthen the information on small objects based on scale-space theory. Motivated by this theory, the FPN is regarded as a scale space and extracts a scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature by three-dimensional convolution on the level axis of the FPN. The defined feature is basically scale-invariant and is built on a high-resolution pyramid feature map for small objects. Additionally, the deigned <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature can be extended to most object detection models based on FPNs. We also designed a feature-level super-resolution approach to show the efficiency of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature. We verified that the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature could improve the classification accuracy for low-resolution images by training a feature-level super-resolution model. To demonstrate the effect of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature, experiments on the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature built-in object detection approach including both one-stage and two-stage models were conducted on the MS COCO dataset. For the two-stage object detection models Faster R-CNN and Mask R-CNN with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, AP improvements of up to 1.6% and 1.4%, respectively, were achieved. Additionally, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> of each model was improved by 1.2% and 1.1%, respectively. Furthermore, the one-stage object detection models in the YOLO series were improved. For YOLOv4-P5, YOLOv4-P6, YOLOR-P6, YOLOR-W6, and YOLOR-D6 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, 0.9%, 0.5%, 0.5%, 0.1%, and 0.1% AP improvements were observed. For small object detection, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> increased by 1.1%, 1.1%, 0.9%, 0.4%, and 0.1%, respectively. Experiments using the feature-level super-resolution approach with the proposed scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature were conducted on the CIFAR-100 dataset. By training the feature-level super-resolution model, we verified that ResNet-101 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature trained on LR images achieved a 55.2% classification accuracy, which was 1.6% higher than for ResNet-101 trained on HR images.
first_indexed 2024-03-11T04:07:06Z
format Article
id doaj.art-adcdb748228a40189e94c022a91ffbf7
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-11T04:07:06Z
publishDate 2023-04-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-adcdb748228a40189e94c022a91ffbf72023-11-17T23:44:19ZengMDPI AGSensors1424-82202023-04-01239443210.3390/s23094432ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object DetectionHye-Jin Park0Ji-Woo Kang1Byung-Gyu Kim2Department of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaDepartment of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaDepartment of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaObject detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential modules for object detection models to consider various object scales. However, the AP for small objects is lower than the AP for medium and large objects. It is difficult to recognize small objects because they do not have sufficient information, and information is lost in deeper CNN layers. This paper proposes a new FPN model named ssFPN (scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature-based feature pyramid network) to detect multi-scale objects, especially small objects. We propose a new scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature that is extracted by 3D convolution on the level of the FPN. It is defined and extracted from the FPN to strengthen the information on small objects based on scale-space theory. Motivated by this theory, the FPN is regarded as a scale space and extracts a scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature by three-dimensional convolution on the level axis of the FPN. The defined feature is basically scale-invariant and is built on a high-resolution pyramid feature map for small objects. Additionally, the deigned <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature can be extended to most object detection models based on FPNs. We also designed a feature-level super-resolution approach to show the efficiency of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature. We verified that the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature could improve the classification accuracy for low-resolution images by training a feature-level super-resolution model. To demonstrate the effect of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature, experiments on the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature built-in object detection approach including both one-stage and two-stage models were conducted on the MS COCO dataset. For the two-stage object detection models Faster R-CNN and Mask R-CNN with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, AP improvements of up to 1.6% and 1.4%, respectively, were achieved. Additionally, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> of each model was improved by 1.2% and 1.1%, respectively. Furthermore, the one-stage object detection models in the YOLO series were improved. For YOLOv4-P5, YOLOv4-P6, YOLOR-P6, YOLOR-W6, and YOLOR-D6 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, 0.9%, 0.5%, 0.5%, 0.1%, and 0.1% AP improvements were observed. For small object detection, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> increased by 1.1%, 1.1%, 0.9%, 0.4%, and 0.1%, respectively. Experiments using the feature-level super-resolution approach with the proposed scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature were conducted on the CIFAR-100 dataset. By training the feature-level super-resolution model, we verified that ResNet-101 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature trained on LR images achieved a 55.2% classification accuracy, which was 1.6% higher than for ResNet-101 trained on HR images.https://www.mdpi.com/1424-8220/23/9/4432object detectionfeature pyramid networkscale sequence (<i>S</i><sup>2</sup>) featureconvolutional neural network (CNN)deep learning
spellingShingle Hye-Jin Park
Ji-Woo Kang
Byung-Gyu Kim
ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
Sensors
object detection
feature pyramid network
scale sequence (<i>S</i><sup>2</sup>) feature
convolutional neural network (CNN)
deep learning
title ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_full ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_fullStr ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_full_unstemmed ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_short ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
title_sort ssfpn scale sequence i s i sup 2 sup feature based feature pyramid network for object detection
topic object detection
feature pyramid network
scale sequence (<i>S</i><sup>2</sup>) feature
convolutional neural network (CNN)
deep learning
url https://www.mdpi.com/1424-8220/23/9/4432
work_keys_str_mv AT hyejinpark ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection
AT jiwookang ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection
AT byunggyukim ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection