Decoupled Cross-Modal Transformer for Referring Video Object Segmentation

Referring video object segmentation (R-VOS) is a fundamental vision-language task which aims to segment the target referred by language expression in all video frames. Existing query-based R-VOS methods have conducted in-depth exploration of the interaction and alignment between visual and linguisti...

Full description

Bibliographic Details
Main Authors:	Ao Wu, Rong Wang, Quange Tan, Zhenfeng Song
Format:	Article
Language:	English
Published:	MDPI AG 2024-08-01
Series:	Sensors
Subjects:	referring video object segmentation cross-modal transformer decoupled queries feature pyramid network
Online Access:	https://www.mdpi.com/1424-8220/24/16/5375

Internet

https://www.mdpi.com/1424-8220/24/16/5375

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation

Internet

Similar Items