Language-aware vision transformer for referring segmentation
Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image...
Main Authors: | , , , , , , |
---|---|
פורמט: | Journal article |
שפה: | English |
יצא לאור: |
IEEE
2024
|
_version_ | 1826315113601171456 |
---|---|
author | Yang, Z Wang, J Ye, X Tang, Y Chen, K Zhao, H Torr, PHS |
author_facet | Yang, Z Wang, J Ye, X Tang, Y Chen, K Zhao, H Torr, PHS |
author_sort | Yang, Z |
collection | OXFORD |
description | Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS. |
first_indexed | 2024-12-09T03:20:06Z |
format | Journal article |
id | oxford-uuid:3b9da0a1-9ba9-41f3-98ee-d1e6fa3ce3dc |
institution | University of Oxford |
language | English |
last_indexed | 2024-12-09T03:20:06Z |
publishDate | 2024 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:3b9da0a1-9ba9-41f3-98ee-d1e6fa3ce3dc2024-11-07T11:41:36ZLanguage-aware vision transformer for referring segmentationJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:3b9da0a1-9ba9-41f3-98ee-d1e6fa3ce3dcEnglishSymplectic ElementsIEEE2024Yang, ZWang, JYe, XTang, YChen, KZhao, HTorr, PHSReferring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS. |
spellingShingle | Yang, Z Wang, J Ye, X Tang, Y Chen, K Zhao, H Torr, PHS Language-aware vision transformer for referring segmentation |
title | Language-aware vision transformer for referring segmentation |
title_full | Language-aware vision transformer for referring segmentation |
title_fullStr | Language-aware vision transformer for referring segmentation |
title_full_unstemmed | Language-aware vision transformer for referring segmentation |
title_short | Language-aware vision transformer for referring segmentation |
title_sort | language aware vision transformer for referring segmentation |
work_keys_str_mv | AT yangz languageawarevisiontransformerforreferringsegmentation AT wangj languageawarevisiontransformerforreferringsegmentation AT yex languageawarevisiontransformerforreferringsegmentation AT tangy languageawarevisiontransformerforreferringsegmentation AT chenk languageawarevisiontransformerforreferringsegmentation AT zhaoh languageawarevisiontransformerforreferringsegmentation AT torrphs languageawarevisiontransformerforreferringsegmentation |