Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
Salient object detection (SOD), which is used to identify the most distinctive object in a given scene, plays an important role in computer vision tasks. Most existing RGB-D SOD methods employ a CNN-based network as the backbone to extract features from RGB and depth images; however, the inherent lo...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-10-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/23/21/8802 |
_version_ | 1797631251615580160 |
---|---|
author | Shuaihui Wang Fengyi Jiang Boqian Xu |
author_facet | Shuaihui Wang Fengyi Jiang Boqian Xu |
author_sort | Shuaihui Wang |
collection | DOAJ |
description | Salient object detection (SOD), which is used to identify the most distinctive object in a given scene, plays an important role in computer vision tasks. Most existing RGB-D SOD methods employ a CNN-based network as the backbone to extract features from RGB and depth images; however, the inherent locality of a CNN-based network limits the performance of CNN-based methods. To tackle this issue, we propose a novel Swin Transformer-based edge guidance network (SwinEGNet) for RGB-D SOD in which the Swin Transformer is employed as a powerful feature extractor to capture the global context. An edge-guided cross-modal interaction module is proposed to effectively enhance and fuse features. In particular, we employed the Swin Transformer as the backbone to extract features from RGB images and depth maps. Then, we introduced the edge extraction module (EEM) to extract edge features and the depth enhancement module (DEM) to enhance depth features. Additionally, a cross-modal interaction module (CIM) was used to integrate cross-modal features from global and local contexts. Finally, we employed a cascaded decoder to refine the prediction map in a coarse-to-fine manner. Extensive experiments demonstrated that our SwinEGNet achieved the best performance on the LFSD, NLPR, DES, and NJU2K datasets and achieved comparable performance on the STEREO dataset compared to 14 state-of-the-art methods. Our model achieved better performance compared to SwinNet, with 88.4% parameters and 77.2% FLOPs. Our code will be publicly available. |
first_indexed | 2024-03-11T11:21:11Z |
format | Article |
id | doaj.art-3448d47dabe348ddbffec7dabad739fc |
institution | Directory Open Access Journal |
issn | 1424-8220 |
language | English |
last_indexed | 2024-03-11T11:21:11Z |
publishDate | 2023-10-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj.art-3448d47dabe348ddbffec7dabad739fc2023-11-10T15:12:08ZengMDPI AGSensors1424-82202023-10-012321880210.3390/s23218802Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object DetectionShuaihui Wang0Fengyi Jiang1Boqian Xu2Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, ChinaChangchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, ChinaChangchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, ChinaSalient object detection (SOD), which is used to identify the most distinctive object in a given scene, plays an important role in computer vision tasks. Most existing RGB-D SOD methods employ a CNN-based network as the backbone to extract features from RGB and depth images; however, the inherent locality of a CNN-based network limits the performance of CNN-based methods. To tackle this issue, we propose a novel Swin Transformer-based edge guidance network (SwinEGNet) for RGB-D SOD in which the Swin Transformer is employed as a powerful feature extractor to capture the global context. An edge-guided cross-modal interaction module is proposed to effectively enhance and fuse features. In particular, we employed the Swin Transformer as the backbone to extract features from RGB images and depth maps. Then, we introduced the edge extraction module (EEM) to extract edge features and the depth enhancement module (DEM) to enhance depth features. Additionally, a cross-modal interaction module (CIM) was used to integrate cross-modal features from global and local contexts. Finally, we employed a cascaded decoder to refine the prediction map in a coarse-to-fine manner. Extensive experiments demonstrated that our SwinEGNet achieved the best performance on the LFSD, NLPR, DES, and NJU2K datasets and achieved comparable performance on the STEREO dataset compared to 14 state-of-the-art methods. Our model achieved better performance compared to SwinNet, with 88.4% parameters and 77.2% FLOPs. Our code will be publicly available.https://www.mdpi.com/1424-8220/23/21/8802RGB-D salient object detectionedge guidancetransformercross-modal interaction |
spellingShingle | Shuaihui Wang Fengyi Jiang Boqian Xu Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection Sensors RGB-D salient object detection edge guidance transformer cross-modal interaction |
title | Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection |
title_full | Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection |
title_fullStr | Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection |
title_full_unstemmed | Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection |
title_short | Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection |
title_sort | swin transformer based edge guidance network for rgb d salient object detection |
topic | RGB-D salient object detection edge guidance transformer cross-modal interaction |
url | https://www.mdpi.com/1424-8220/23/21/8802 |
work_keys_str_mv | AT shuaihuiwang swintransformerbasededgeguidancenetworkforrgbdsalientobjectdetection AT fengyijiang swintransformerbasededgeguidancenetworkforrgbdsalientobjectdetection AT boqianxu swintransformerbasededgeguidancenetworkforrgbdsalientobjectdetection |