CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation

Recent research has shown that visual–text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-lev...

Full description

Bibliographic Details
Main Authors:	Shi-Cheng Guo, Shang-Kun Liu, Jing-Yu Wang, Wei-Min Zheng, Cheng-Yu Jiang
Format:	Article
Language:	English
Published:	MDPI AG 2023-09-01
Series:	Entropy
Subjects:	few-shot semantic segmentation few-shot learning semantic segmentation multi-modal CLIP
Online Access:	https://www.mdpi.com/1099-4300/25/9/1353

_version_	1797580186034634752
author	Shi-Cheng Guo Shang-Kun Liu Jing-Yu Wang Wei-Min Zheng Cheng-Yu Jiang
author_facet	Shi-Cheng Guo Shang-Kun Liu Jing-Yu Wang Wei-Min Zheng Cheng-Yu Jiang
author_sort	Shi-Cheng Guo
collection	DOAJ
description	Recent research has shown that visual–text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-level tasks. We explore the potential abilities of CLIP in the field of few-shot segmentation. The current mainstream approach is to utilize support and query features to generate class prototypes and then use the prototype features to match image features. We propose a new method that utilizes CLIP to extract text features for a specific class. These text features are then used as training samples to participate in the model’s training process. The addition of text features enables model to extract features that contain richer semantic information, thus making it easier to capture potential class information. To better match the query image features, we also propose a new prototype generation method that incorporates multi-modal fusion features of text and images in the prototype generation process. Adaptive query prototypes were generated by combining foreground and background information from the images with the multi-modal support prototype, thereby allowing for a better matching of image features and improved segmentation accuracy. We provide a new perspective to the task of few-shot segmentation in multi-modal scenarios. Experiments demonstrate that our proposed method achieves excellent results on two common datasets, PASCAL-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>5</mn><mi>i</mi></msup></semantics></math></inline-formula> and COCO-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>20</mn><mi>i</mi></msup></semantics></math></inline-formula>.
first_indexed	2024-03-10T22:47:47Z
format	Article
id	doaj.art-e43ec2f4ff1e4380933c60815a4a0013
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-03-10T22:47:47Z
publishDate	2023-09-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-e43ec2f4ff1e4380933c60815a4a00132023-11-19T10:36:25ZengMDPI AGEntropy1099-43002023-09-01259135310.3390/e25091353CLIP-Driven Prototype Network for Few-Shot Semantic SegmentationShi-Cheng Guo0Shang-Kun Liu1Jing-Yu Wang2Wei-Min Zheng3Cheng-Yu Jiang4College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, ChinaRecent research has shown that visual–text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-level tasks. We explore the potential abilities of CLIP in the field of few-shot segmentation. The current mainstream approach is to utilize support and query features to generate class prototypes and then use the prototype features to match image features. We propose a new method that utilizes CLIP to extract text features for a specific class. These text features are then used as training samples to participate in the model’s training process. The addition of text features enables model to extract features that contain richer semantic information, thus making it easier to capture potential class information. To better match the query image features, we also propose a new prototype generation method that incorporates multi-modal fusion features of text and images in the prototype generation process. Adaptive query prototypes were generated by combining foreground and background information from the images with the multi-modal support prototype, thereby allowing for a better matching of image features and improved segmentation accuracy. We provide a new perspective to the task of few-shot segmentation in multi-modal scenarios. Experiments demonstrate that our proposed method achieves excellent results on two common datasets, PASCAL-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>5</mn><mi>i</mi></msup></semantics></math></inline-formula> and COCO-<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mn>20</mn><mi>i</mi></msup></semantics></math></inline-formula>.https://www.mdpi.com/1099-4300/25/9/1353few-shot semantic segmentationfew-shot learningsemantic segmentationmulti-modalCLIP
spellingShingle	Shi-Cheng Guo Shang-Kun Liu Jing-Yu Wang Wei-Min Zheng Cheng-Yu Jiang CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation Entropy few-shot semantic segmentation few-shot learning semantic segmentation multi-modal CLIP
title	CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_full	CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_fullStr	CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_full_unstemmed	CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_short	CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
title_sort	clip driven prototype network for few shot semantic segmentation
topic	few-shot semantic segmentation few-shot learning semantic segmentation multi-modal CLIP
url	https://www.mdpi.com/1099-4300/25/9/1353
work_keys_str_mv	AT shichengguo clipdrivenprototypenetworkforfewshotsemanticsegmentation AT shangkunliu clipdrivenprototypenetworkforfewshotsemanticsegmentation AT jingyuwang clipdrivenprototypenetworkforfewshotsemanticsegmentation AT weiminzheng clipdrivenprototypenetworkforfewshotsemanticsegmentation AT chengyujiang clipdrivenprototypenetworkforfewshotsemanticsegmentation

CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation

Similar Items