ClearCLIP: decomposing CLIP representations for dense vision-language inference
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate th...
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Conference Paper |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/180251 http://arxiv.org/abs/2407.12442v1 |
Summary: | Despite the success of large-scale pretrained Vision-Language Models (VLMs)
especially CLIP in various open-vocabulary tasks, their application to semantic
segmentation remains challenging, producing noisy segmentation maps with
mis-segmented regions. In this paper, we carefully re-investigate the
architecture of CLIP, and identify residual connections as the primary source
of noise that degrades segmentation quality. With a comparative analysis of
statistical properties in the residual connection and the attention output
across different pretrained models, we discover that CLIP's image-text
contrastive training paradigm emphasizes global features at the expense of
local discriminability, leading to noisy segmentation results. In response, we
propose ClearCLIP, a novel approach that decomposes CLIP's representations to
enhance open-vocabulary semantic segmentation. We introduce three simple
modifications to the final layer: removing the residual connection,
implementing the self-self attention, and discarding the feed-forward network.
ClearCLIP consistently generates clearer and more accurate segmentation maps
and outperforms existing approaches across multiple benchmarks, affirming the
significance of our discoveries. |
---|