Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively

The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into...

Full description

Bibliographic Details
Main Authors:	Yuan, Haobo, Li, Xiangtai, Zhou, Chong, Li, Yining, Chen, Kai, Loy, Chen Change
Other Authors:	College of Computing and Data Science
Format:	Conference Paper
Language:	English
Published:	2024
Subjects:	Computer and Information Science Scene understanding Promptable segmentation
Online Access:	https://hdl.handle.net/10356/180250 http://arxiv.org/abs/2401.02955v2

_version_	1826119078466551808
author	Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change
author2	College of Computing and Data Science
author_facet	College of Computing and Data Science Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change
author_sort	Yuan, Haobo
collection	NTU
description	The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.
first_indexed	2024-10-01T04:54:10Z
format	Conference Paper
id	ntu-10356/180250
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T04:54:10Z
publishDate	2024
record_format	dspace
spelling	ntu-10356/1802502024-09-26T05:59:36Z Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change College of Computing and Data Science 2024 European Conference on Computer Vision (ECCV) S-Lab Computer and Information Science Scene understanding Promptable segmentation The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes. Ministry of Education (MOE) Submitted/Accepted version This study is supported under the RIE2020 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). The project is also supported by Singapore MOE AcRF Tier 1 (RG16/21) and the National Key R&D Program of China (No. 2022ZD0161600). 2024-09-26T03:29:46Z 2024-09-26T03:29:46Z 2024 Conference Paper Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K. & Loy, C. C. (2024). Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively. 2024 European Conference on Computer Vision (ECCV). https://dx.doi.org/10.48550/arXiv.2401.02955 https://hdl.handle.net/10356/180250 10.48550/arXiv.2401.02955 http://arxiv.org/abs/2401.02955v2 en IAF-ICP RG16/21 10.21979/N9/L05ULT © 2024 ECCV. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. application/pdf
spellingShingle	Computer and Information Science Scene understanding Promptable segmentation Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title	Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_full	Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_fullStr	Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_full_unstemmed	Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_short	Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_sort	open vocabulary sam segment and recognize twenty thousand classes interactively
topic	Computer and Information Science Scene understanding Promptable segmentation
url	https://hdl.handle.net/10356/180250 http://arxiv.org/abs/2401.02955v2
work_keys_str_mv	AT yuanhaobo openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT lixiangtai openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT zhouchong openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT liyining openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT chenkai openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT loychenchange openvocabularysamsegmentandrecognizetwentythousandclassesinteractively

Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively

Similar Items