Language-aware vision transformer for referring segmentation

Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image...

תיאור מלא

מידע ביבליוגרפי
Main Authors: Yang, Z, Wang, J, Ye, X, Tang, Y, Chen, K, Zhao, H, Torr, PHS
פורמט: Journal article
שפה:English
יצא לאור: IEEE 2024
_version_ 1826315113601171456
author Yang, Z
Wang, J
Ye, X
Tang, Y
Chen, K
Zhao, H
Torr, PHS
author_facet Yang, Z
Wang, J
Ye, X
Tang, Y
Chen, K
Zhao, H
Torr, PHS
author_sort Yang, Z
collection OXFORD
description Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.
first_indexed 2024-12-09T03:20:06Z
format Journal article
id oxford-uuid:3b9da0a1-9ba9-41f3-98ee-d1e6fa3ce3dc
institution University of Oxford
language English
last_indexed 2024-12-09T03:20:06Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:3b9da0a1-9ba9-41f3-98ee-d1e6fa3ce3dc2024-11-07T11:41:36ZLanguage-aware vision transformer for referring segmentationJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:3b9da0a1-9ba9-41f3-98ee-d1e6fa3ce3dcEnglishSymplectic ElementsIEEE2024Yang, ZWang, JYe, XTang, YChen, KZhao, HTorr, PHSReferring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.
spellingShingle Yang, Z
Wang, J
Ye, X
Tang, Y
Chen, K
Zhao, H
Torr, PHS
Language-aware vision transformer for referring segmentation
title Language-aware vision transformer for referring segmentation
title_full Language-aware vision transformer for referring segmentation
title_fullStr Language-aware vision transformer for referring segmentation
title_full_unstemmed Language-aware vision transformer for referring segmentation
title_short Language-aware vision transformer for referring segmentation
title_sort language aware vision transformer for referring segmentation
work_keys_str_mv AT yangz languageawarevisiontransformerforreferringsegmentation
AT wangj languageawarevisiontransformerforreferringsegmentation
AT yex languageawarevisiontransformerforreferringsegmentation
AT tangy languageawarevisiontransformerforreferringsegmentation
AT chenk languageawarevisiontransformerforreferringsegmentation
AT zhaoh languageawarevisiontransformerforreferringsegmentation
AT torrphs languageawarevisiontransformerforreferringsegmentation