ClearCLIP: decomposing CLIP representations for dense vision-language inference

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate th...

Full description

Bibliographic Details
Main Authors:	Lan, Mengcheng, Chen, Chaofeng, Ke, Yiping, Wang, Xinjiang, Feng, Litong, Zhang, Wayne
Other Authors:	College of Computing and Data Science
Format:	Conference Paper
Language:	English
Published:	2024
Subjects:	Computer and Information Science Semantic segmentation Vision language model
Online Access:	https://hdl.handle.net/10356/180251 http://arxiv.org/abs/2407.12442v1

Internet

https://hdl.handle.net/10356/180251
http://arxiv.org/abs/2407.12442v1

ClearCLIP: decomposing CLIP representations for dense vision-language inference

Internet

Similar Items