Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection

Salient object detection (SOD), which is used to identify the most distinctive object in a given scene, plays an important role in computer vision tasks. Most existing RGB-D SOD methods employ a CNN-based network as the backbone to extract features from RGB and depth images; however, the inherent lo...

Full description

Bibliographic Details
Main Authors: Shuaihui Wang, Fengyi Jiang, Boqian Xu
Format: Article
Language:English
Published: MDPI AG 2023-10-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/23/21/8802
_version_ 1797631251615580160
author Shuaihui Wang
Fengyi Jiang
Boqian Xu
author_facet Shuaihui Wang
Fengyi Jiang
Boqian Xu
author_sort Shuaihui Wang
collection DOAJ
description Salient object detection (SOD), which is used to identify the most distinctive object in a given scene, plays an important role in computer vision tasks. Most existing RGB-D SOD methods employ a CNN-based network as the backbone to extract features from RGB and depth images; however, the inherent locality of a CNN-based network limits the performance of CNN-based methods. To tackle this issue, we propose a novel Swin Transformer-based edge guidance network (SwinEGNet) for RGB-D SOD in which the Swin Transformer is employed as a powerful feature extractor to capture the global context. An edge-guided cross-modal interaction module is proposed to effectively enhance and fuse features. In particular, we employed the Swin Transformer as the backbone to extract features from RGB images and depth maps. Then, we introduced the edge extraction module (EEM) to extract edge features and the depth enhancement module (DEM) to enhance depth features. Additionally, a cross-modal interaction module (CIM) was used to integrate cross-modal features from global and local contexts. Finally, we employed a cascaded decoder to refine the prediction map in a coarse-to-fine manner. Extensive experiments demonstrated that our SwinEGNet achieved the best performance on the LFSD, NLPR, DES, and NJU2K datasets and achieved comparable performance on the STEREO dataset compared to 14 state-of-the-art methods. Our model achieved better performance compared to SwinNet, with 88.4% parameters and 77.2% FLOPs. Our code will be publicly available.
first_indexed 2024-03-11T11:21:11Z
format Article
id doaj.art-3448d47dabe348ddbffec7dabad739fc
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-11T11:21:11Z
publishDate 2023-10-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-3448d47dabe348ddbffec7dabad739fc2023-11-10T15:12:08ZengMDPI AGSensors1424-82202023-10-012321880210.3390/s23218802Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object DetectionShuaihui Wang0Fengyi Jiang1Boqian Xu2Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, ChinaChangchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, ChinaChangchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, ChinaSalient object detection (SOD), which is used to identify the most distinctive object in a given scene, plays an important role in computer vision tasks. Most existing RGB-D SOD methods employ a CNN-based network as the backbone to extract features from RGB and depth images; however, the inherent locality of a CNN-based network limits the performance of CNN-based methods. To tackle this issue, we propose a novel Swin Transformer-based edge guidance network (SwinEGNet) for RGB-D SOD in which the Swin Transformer is employed as a powerful feature extractor to capture the global context. An edge-guided cross-modal interaction module is proposed to effectively enhance and fuse features. In particular, we employed the Swin Transformer as the backbone to extract features from RGB images and depth maps. Then, we introduced the edge extraction module (EEM) to extract edge features and the depth enhancement module (DEM) to enhance depth features. Additionally, a cross-modal interaction module (CIM) was used to integrate cross-modal features from global and local contexts. Finally, we employed a cascaded decoder to refine the prediction map in a coarse-to-fine manner. Extensive experiments demonstrated that our SwinEGNet achieved the best performance on the LFSD, NLPR, DES, and NJU2K datasets and achieved comparable performance on the STEREO dataset compared to 14 state-of-the-art methods. Our model achieved better performance compared to SwinNet, with 88.4% parameters and 77.2% FLOPs. Our code will be publicly available.https://www.mdpi.com/1424-8220/23/21/8802RGB-D salient object detectionedge guidancetransformercross-modal interaction
spellingShingle Shuaihui Wang
Fengyi Jiang
Boqian Xu
Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
Sensors
RGB-D salient object detection
edge guidance
transformer
cross-modal interaction
title Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
title_full Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
title_fullStr Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
title_full_unstemmed Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
title_short Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
title_sort swin transformer based edge guidance network for rgb d salient object detection
topic RGB-D salient object detection
edge guidance
transformer
cross-modal interaction
url https://www.mdpi.com/1424-8220/23/21/8802
work_keys_str_mv AT shuaihuiwang swintransformerbasededgeguidancenetworkforrgbdsalientobjectdetection
AT fengyijiang swintransformerbasededgeguidancenetworkforrgbdsalientobjectdetection
AT boqianxu swintransformerbasededgeguidancenetworkforrgbdsalientobjectdetection