TANet: Transformer‐based asymmetric network for RGB‐D salient object detection

Abstract Existing RGB‐D salient object detection methods mainly rely on a symmetric two‐stream Convolutional Neural Network (CNN)‐based network to extract RGB and depth channel features separately. However, there are two problems with the symmetric conventional network structure: first, the ability...

Full description

Bibliographic Details
Main Authors: Chang Liu, Gang Yang, Shuo Wang, Hangxu Wang, Yunhua Zhang, Yutao Wang
Format: Article
Language:English
Published: Wiley 2023-06-01
Series:IET Computer Vision
Subjects:
Online Access:https://doi.org/10.1049/cvi2.12177
_version_ 1827934519978622976
author Chang Liu
Gang Yang
Shuo Wang
Hangxu Wang
Yunhua Zhang
Yutao Wang
author_facet Chang Liu
Gang Yang
Shuo Wang
Hangxu Wang
Yunhua Zhang
Yutao Wang
author_sort Chang Liu
collection DOAJ
description Abstract Existing RGB‐D salient object detection methods mainly rely on a symmetric two‐stream Convolutional Neural Network (CNN)‐based network to extract RGB and depth channel features separately. However, there are two problems with the symmetric conventional network structure: first, the ability of CNN in learning global contexts is limited; second, the symmetric two‐stream structure ignores the inherent differences between modalities. In this study, a Transformer‐based asymmetric network is proposed to tackle the issues mentioned above. The authors employ the powerful feature extraction capability of Transformer to extract global semantic information from RGB data and design a lightweight CNN backbone to extract spatial structure information from depth data without pre‐training. The asymmetric hybrid encoder effectively reduces the number of parameters in the model while increasing speed without sacrificing performance. Then, a cross‐modal feature fusion module which enhances and fuses RGB and depth features with each other is designed. Finally, the authors add edge prediction as an auxiliary task and propose an edge enhancement module to generate sharper contours. Extensive experiments demonstrate that our method achieves superior performance over 14 state‐of‐the‐art RGB‐D methods on six public datasets. The code of the authors will be released at https://github.com/lc012463/TANet.
first_indexed 2024-03-13T07:40:40Z
format Article
id doaj.art-6f13e4d24bb8493fbabc0c006c58ead6
institution Directory Open Access Journal
issn 1751-9632
1751-9640
language English
last_indexed 2024-03-13T07:40:40Z
publishDate 2023-06-01
publisher Wiley
record_format Article
series IET Computer Vision
spelling doaj.art-6f13e4d24bb8493fbabc0c006c58ead62023-06-03T07:26:31ZengWileyIET Computer Vision1751-96321751-96402023-06-0117441543010.1049/cvi2.12177TANet: Transformer‐based asymmetric network for RGB‐D salient object detectionChang Liu0Gang Yang1Shuo Wang2Hangxu Wang3Yunhua Zhang4Yutao Wang5Northeastern University Shenyang Liaoning ChinaNortheastern University Shenyang Liaoning ChinaNortheastern University Shenyang Liaoning ChinaNortheastern University Shenyang Liaoning ChinaNortheastern University Shenyang Liaoning ChinaNortheastern University Shenyang Liaoning ChinaAbstract Existing RGB‐D salient object detection methods mainly rely on a symmetric two‐stream Convolutional Neural Network (CNN)‐based network to extract RGB and depth channel features separately. However, there are two problems with the symmetric conventional network structure: first, the ability of CNN in learning global contexts is limited; second, the symmetric two‐stream structure ignores the inherent differences between modalities. In this study, a Transformer‐based asymmetric network is proposed to tackle the issues mentioned above. The authors employ the powerful feature extraction capability of Transformer to extract global semantic information from RGB data and design a lightweight CNN backbone to extract spatial structure information from depth data without pre‐training. The asymmetric hybrid encoder effectively reduces the number of parameters in the model while increasing speed without sacrificing performance. Then, a cross‐modal feature fusion module which enhances and fuses RGB and depth features with each other is designed. Finally, the authors add edge prediction as an auxiliary task and propose an edge enhancement module to generate sharper contours. Extensive experiments demonstrate that our method achieves superior performance over 14 state‐of‐the‐art RGB‐D methods on six public datasets. The code of the authors will be released at https://github.com/lc012463/TANet.https://doi.org/10.1049/cvi2.12177computer visionimage segmentationobject detection
spellingShingle Chang Liu
Gang Yang
Shuo Wang
Hangxu Wang
Yunhua Zhang
Yutao Wang
TANet: Transformer‐based asymmetric network for RGB‐D salient object detection
IET Computer Vision
computer vision
image segmentation
object detection
title TANet: Transformer‐based asymmetric network for RGB‐D salient object detection
title_full TANet: Transformer‐based asymmetric network for RGB‐D salient object detection
title_fullStr TANet: Transformer‐based asymmetric network for RGB‐D salient object detection
title_full_unstemmed TANet: Transformer‐based asymmetric network for RGB‐D salient object detection
title_short TANet: Transformer‐based asymmetric network for RGB‐D salient object detection
title_sort tanet transformer based asymmetric network for rgb d salient object detection
topic computer vision
image segmentation
object detection
url https://doi.org/10.1049/cvi2.12177
work_keys_str_mv AT changliu tanettransformerbasedasymmetricnetworkforrgbdsalientobjectdetection
AT gangyang tanettransformerbasedasymmetricnetworkforrgbdsalientobjectdetection
AT shuowang tanettransformerbasedasymmetricnetworkforrgbdsalientobjectdetection
AT hangxuwang tanettransformerbasedasymmetricnetworkforrgbdsalientobjectdetection
AT yunhuazhang tanettransformerbasedasymmetricnetworkforrgbdsalientobjectdetection
AT yutaowang tanettransformerbasedasymmetricnetworkforrgbdsalientobjectdetection