Decoupled Cross-Modal Transformer for Referring Video Object Segmentation

Referring video object segmentation (R-VOS) is a fundamental vision-language task which aims to segment the target referred by language expression in all video frames. Existing query-based R-VOS methods have conducted in-depth exploration of the interaction and alignment between visual and linguisti...

Full description

Bibliographic Details
Main Authors: Ao Wu, Rong Wang, Quange Tan, Zhenfeng Song
Format: Article
Language:English
Published: MDPI AG 2024-08-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/24/16/5375