Semantics-aware dynamic localization and refinement for referring image segmentation

Referring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based me...

Full description

Bibliographic Details
Main Authors: Yang, Z, Wang, J, Tang, Y, Chen, K, Zhao, H, Torr, PHS
Format: Conference item
Language:English
Published: AAAI Conference on Artificial Intelligence 2023
_version_ 1826313141783363584
author Yang, Z
Wang, J
Tang, Y
Chen, K
Zhao, H
Torr, PHS
author_facet Yang, Z
Wang, J
Tang, Y
Chen, K
Zhao, H
Torr, PHS
author_sort Yang, Z
collection OXFORD
description Referring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based methods are subject to specific encoder choices, while attention-based methods offer limited gains. In this work, we introduce a simple yet effective alternative for progressively learning discriminative multi-modal features. The core idea of our approach is to leverage a continuously updated query as the representation of the target object and at each iteration, strengthen multi-modal features strongly correlated to the query while weakening less related ones. As the query is initialized by language features and successively updated by object features, our algorithm gradually shifts from being localization-centric to segmentation-centric. This strategy enables the incremental recovery of missing object parts and/or removal of extraneous parts through iteration. Compared to its counterparts, our method is more versatile - it can be plugged into prior arts straightforwardly and consistently bring improvements. Experimental results on the challenging datasets of RefCOCO, RefCOCO+, and G-Ref demonstrate its advantage with respect to the state-of-the-art methods.
first_indexed 2024-09-25T04:08:29Z
format Conference item
id oxford-uuid:240c4acb-30ea-4fb2-a605-b0d973fbec3d
institution University of Oxford
language English
last_indexed 2024-09-25T04:08:29Z
publishDate 2023
publisher AAAI Conference on Artificial Intelligence
record_format dspace
spelling oxford-uuid:240c4acb-30ea-4fb2-a605-b0d973fbec3d2024-06-07T11:57:18ZSemantics-aware dynamic localization and refinement for referring image segmentationConference itemhttp://purl.org/coar/resource_type/c_5794uuid:240c4acb-30ea-4fb2-a605-b0d973fbec3dEnglishSymplectic ElementsAAAI Conference on Artificial Intelligence2023Yang, ZWang, JTang, YChen, KZhao, HTorr, PHSReferring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based methods are subject to specific encoder choices, while attention-based methods offer limited gains. In this work, we introduce a simple yet effective alternative for progressively learning discriminative multi-modal features. The core idea of our approach is to leverage a continuously updated query as the representation of the target object and at each iteration, strengthen multi-modal features strongly correlated to the query while weakening less related ones. As the query is initialized by language features and successively updated by object features, our algorithm gradually shifts from being localization-centric to segmentation-centric. This strategy enables the incremental recovery of missing object parts and/or removal of extraneous parts through iteration. Compared to its counterparts, our method is more versatile - it can be plugged into prior arts straightforwardly and consistently bring improvements. Experimental results on the challenging datasets of RefCOCO, RefCOCO+, and G-Ref demonstrate its advantage with respect to the state-of-the-art methods.
spellingShingle Yang, Z
Wang, J
Tang, Y
Chen, K
Zhao, H
Torr, PHS
Semantics-aware dynamic localization and refinement for referring image segmentation
title Semantics-aware dynamic localization and refinement for referring image segmentation
title_full Semantics-aware dynamic localization and refinement for referring image segmentation
title_fullStr Semantics-aware dynamic localization and refinement for referring image segmentation
title_full_unstemmed Semantics-aware dynamic localization and refinement for referring image segmentation
title_short Semantics-aware dynamic localization and refinement for referring image segmentation
title_sort semantics aware dynamic localization and refinement for referring image segmentation
work_keys_str_mv AT yangz semanticsawaredynamiclocalizationandrefinementforreferringimagesegmentation
AT wangj semanticsawaredynamiclocalizationandrefinementforreferringimagesegmentation
AT tangy semanticsawaredynamiclocalizationandrefinementforreferringimagesegmentation
AT chenk semanticsawaredynamiclocalizationandrefinementforreferringimagesegmentation
AT zhaoh semanticsawaredynamiclocalizationandrefinementforreferringimagesegmentation
AT torrphs semanticsawaredynamiclocalizationandrefinementforreferringimagesegmentation