Decoupled Cross-Modal Transformer for Referring Video Object Segmentation
Referring video object segmentation (R-VOS) is a fundamental vision-language task which aims to segment the target referred by language expression in all video frames. Existing query-based R-VOS methods have conducted in-depth exploration of the interaction and alignment between visual and linguisti...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-08-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/24/16/5375 |