ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection
Object detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential module...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-04-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/23/9/4432 |
_version_ | 1797601686462660608 |
---|---|
author | Hye-Jin Park Ji-Woo Kang Byung-Gyu Kim |
author_facet | Hye-Jin Park Ji-Woo Kang Byung-Gyu Kim |
author_sort | Hye-Jin Park |
collection | DOAJ |
description | Object detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential modules for object detection models to consider various object scales. However, the AP for small objects is lower than the AP for medium and large objects. It is difficult to recognize small objects because they do not have sufficient information, and information is lost in deeper CNN layers. This paper proposes a new FPN model named ssFPN (scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature-based feature pyramid network) to detect multi-scale objects, especially small objects. We propose a new scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature that is extracted by 3D convolution on the level of the FPN. It is defined and extracted from the FPN to strengthen the information on small objects based on scale-space theory. Motivated by this theory, the FPN is regarded as a scale space and extracts a scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature by three-dimensional convolution on the level axis of the FPN. The defined feature is basically scale-invariant and is built on a high-resolution pyramid feature map for small objects. Additionally, the deigned <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature can be extended to most object detection models based on FPNs. We also designed a feature-level super-resolution approach to show the efficiency of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature. We verified that the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature could improve the classification accuracy for low-resolution images by training a feature-level super-resolution model. To demonstrate the effect of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature, experiments on the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature built-in object detection approach including both one-stage and two-stage models were conducted on the MS COCO dataset. For the two-stage object detection models Faster R-CNN and Mask R-CNN with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, AP improvements of up to 1.6% and 1.4%, respectively, were achieved. Additionally, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> of each model was improved by 1.2% and 1.1%, respectively. Furthermore, the one-stage object detection models in the YOLO series were improved. For YOLOv4-P5, YOLOv4-P6, YOLOR-P6, YOLOR-W6, and YOLOR-D6 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, 0.9%, 0.5%, 0.5%, 0.1%, and 0.1% AP improvements were observed. For small object detection, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> increased by 1.1%, 1.1%, 0.9%, 0.4%, and 0.1%, respectively. Experiments using the feature-level super-resolution approach with the proposed scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature were conducted on the CIFAR-100 dataset. By training the feature-level super-resolution model, we verified that ResNet-101 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature trained on LR images achieved a 55.2% classification accuracy, which was 1.6% higher than for ResNet-101 trained on HR images. |
first_indexed | 2024-03-11T04:07:06Z |
format | Article |
id | doaj.art-adcdb748228a40189e94c022a91ffbf7 |
institution | Directory Open Access Journal |
issn | 1424-8220 |
language | English |
last_indexed | 2024-03-11T04:07:06Z |
publishDate | 2023-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj.art-adcdb748228a40189e94c022a91ffbf72023-11-17T23:44:19ZengMDPI AGSensors1424-82202023-04-01239443210.3390/s23094432ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object DetectionHye-Jin Park0Ji-Woo Kang1Byung-Gyu Kim2Department of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaDepartment of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaDepartment of Artificial Intelligence Engineering, Sookmyung Women’s University, 100 Chungpa-ro 47 gil, Yongsna-gu, Seoul 04310, Republic of KoreaObject detection is a fundamental task in computer vision. Over the past several years, convolutional neural network (CNN)-based object detection models have significantly improved detection accuracyin terms of average precision (AP). Furthermore, feature pyramid networks (FPNs) are essential modules for object detection models to consider various object scales. However, the AP for small objects is lower than the AP for medium and large objects. It is difficult to recognize small objects because they do not have sufficient information, and information is lost in deeper CNN layers. This paper proposes a new FPN model named ssFPN (scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature-based feature pyramid network) to detect multi-scale objects, especially small objects. We propose a new scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature that is extracted by 3D convolution on the level of the FPN. It is defined and extracted from the FPN to strengthen the information on small objects based on scale-space theory. Motivated by this theory, the FPN is regarded as a scale space and extracts a scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature by three-dimensional convolution on the level axis of the FPN. The defined feature is basically scale-invariant and is built on a high-resolution pyramid feature map for small objects. Additionally, the deigned <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature can be extended to most object detection models based on FPNs. We also designed a feature-level super-resolution approach to show the efficiency of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature. We verified that the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature could improve the classification accuracy for low-resolution images by training a feature-level super-resolution model. To demonstrate the effect of the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature, experiments on the scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature built-in object detection approach including both one-stage and two-stage models were conducted on the MS COCO dataset. For the two-stage object detection models Faster R-CNN and Mask R-CNN with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, AP improvements of up to 1.6% and 1.4%, respectively, were achieved. Additionally, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> of each model was improved by 1.2% and 1.1%, respectively. Furthermore, the one-stage object detection models in the YOLO series were improved. For YOLOv4-P5, YOLOv4-P6, YOLOR-P6, YOLOR-W6, and YOLOR-D6 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature, 0.9%, 0.5%, 0.5%, 0.1%, and 0.1% AP improvements were observed. For small object detection, the AP<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mrow></mrow><mi>S</mi></msub></semantics></math></inline-formula> increased by 1.1%, 1.1%, 0.9%, 0.4%, and 0.1%, respectively. Experiments using the feature-level super-resolution approach with the proposed scale sequence (<i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>) feature were conducted on the CIFAR-100 dataset. By training the feature-level super-resolution model, we verified that ResNet-101 with the <i>S</i><inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula> feature trained on LR images achieved a 55.2% classification accuracy, which was 1.6% higher than for ResNet-101 trained on HR images.https://www.mdpi.com/1424-8220/23/9/4432object detectionfeature pyramid networkscale sequence (<i>S</i><sup>2</sup>) featureconvolutional neural network (CNN)deep learning |
spellingShingle | Hye-Jin Park Ji-Woo Kang Byung-Gyu Kim ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection Sensors object detection feature pyramid network scale sequence (<i>S</i><sup>2</sup>) feature convolutional neural network (CNN) deep learning |
title | ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection |
title_full | ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection |
title_fullStr | ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection |
title_full_unstemmed | ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection |
title_short | ssFPN: Scale Sequence (<i>S</i><sup>2</sup>) Feature-Based Feature Pyramid Network for Object Detection |
title_sort | ssfpn scale sequence i s i sup 2 sup feature based feature pyramid network for object detection |
topic | object detection feature pyramid network scale sequence (<i>S</i><sup>2</sup>) feature convolutional neural network (CNN) deep learning |
url | https://www.mdpi.com/1424-8220/23/9/4432 |
work_keys_str_mv | AT hyejinpark ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection AT jiwookang ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection AT byunggyukim ssfpnscalesequenceisisup2supfeaturebasedfeaturepyramidnetworkforobjectdetection |