CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation

Recent research has shown that visual–text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-lev...

Full description

Bibliographic Details
Main Authors: Shi-Cheng Guo, Shang-Kun Liu, Jing-Yu Wang, Wei-Min Zheng, Cheng-Yu Jiang
Format: Article
Language:English
Published: MDPI AG 2023-09-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/25/9/1353
_version_ 1797580186034634752
author Shi-Cheng Guo
Shang-Kun Liu
Jing-Yu Wang
Wei-Min Zheng
Cheng-Yu Jiang
author_facet Shi-Cheng Guo
Shang-Kun Liu
Jing-Yu Wang
Wei-Min Zheng
Cheng-Yu Jiang
author_sort Shi-Cheng Guo
collection DOAJ
description Recent research has shown that visual–text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-level tasks. We explore the potential abilities of CLIP in the field of few-shot segmentation. The current mainstream approach is to utilize support and query features to generate class prototypes and then use the prototype features to match image features. We propose a new method that utilizes CLIP to extract text features for a specific class. These text features are then used as training samples to participate in the model’s training process. The addition of text features enables model to extract features that contain richer semantic information, thus making it easier to capture potential class information. To better match the query image features, we also propose a new prototype generation method that incorporates multi-modal fusion features of text and images in the prototype generation process. Adaptive query prototypes were generated by combining foreground and background information from the images with the multi-modal support prototype, thereby allowing for a better matching of image features and improved segmentation accuracy. We provide a new perspective to the task of few-shot segmentation in multi-modal scenarios. Experiments demonstrate that our proposed method achieves excellent results on two common datasets, PASCAL-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>5</mn><mi>i</mi></msup></semantics></math></inline-formula> and COCO-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>20</mn><mi>i</mi></msup></semantics></math></inline-formula>.
first_indexed 2024-03-10T22:47:47Z
format Article
id doaj.art-e43ec2f4ff1e4380933c60815a4a0013
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-03-10T22:47:47Z
publishDate 2023-09-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-e43ec2f4ff1e4380933c60815a4a00132023-11-19T10:36:25ZengMDPI AGEntropy1099-43002023-09-01259135310.3390/e25091353CLIP-Driven Prototype Network for Few-Shot Semantic SegmentationShi-Cheng Guo0Shang-Kun Liu1Jing-Yu Wang2Wei-Min Zheng3Cheng-Yu Jiang4College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaRecent research has shown that visual–text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-level tasks. We explore the potential abilities of CLIP in the field of few-shot segmentation. The current mainstream approach is to utilize support and query features to generate class prototypes and then use the prototype features to match image features. We propose a new method that utilizes CLIP to extract text features for a specific class. These text features are then used as training samples to participate in the model’s training process. The addition of text features enables model to extract features that contain richer semantic information, thus making it easier to capture potential class information. To better match the query image features, we also propose a new prototype generation method that incorporates multi-modal fusion features of text and images in the prototype generation process. Adaptive query prototypes were generated by combining foreground and background information from the images with the multi-modal support prototype, thereby allowing for a better matching of image features and improved segmentation accuracy. We provide a new perspective to the task of few-shot segmentation in multi-modal scenarios. Experiments demonstrate that our proposed method achieves excellent results on two common datasets, PASCAL-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>5</mn><mi>i</mi></msup></semantics></math></inline-formula> and COCO-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>20</mn><mi>i</mi></msup></semantics></math></inline-formula>.https://www.mdpi.com/1099-4300/25/9/1353few-shot semantic segmentationfew-shot learningsemantic segmentationmulti-modalCLIP
spellingShingle Shi-Cheng Guo
Shang-Kun Liu
Jing-Yu Wang
Wei-Min Zheng
Cheng-Yu Jiang
CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
Entropy
few-shot semantic segmentation
few-shot learning
semantic segmentation
multi-modal
CLIP
title CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_full CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_fullStr CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_full_unstemmed CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_short CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_sort clip driven prototype network for few shot semantic segmentation
topic few-shot semantic segmentation
few-shot learning
semantic segmentation
multi-modal
CLIP
url https://www.mdpi.com/1099-4300/25/9/1353
work_keys_str_mv AT shichengguo clipdrivenprototypenetworkforfewshotsemanticsegmentation
AT shangkunliu clipdrivenprototypenetworkforfewshotsemanticsegmentation
AT jingyuwang clipdrivenprototypenetworkforfewshotsemanticsegmentation
AT weiminzheng clipdrivenprototypenetworkforfewshotsemanticsegmentation
AT chengyujiang clipdrivenprototypenetworkforfewshotsemanticsegmentation