Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively

The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into...

Full description

Bibliographic Details
Main Authors: Yuan, Haobo, Li, Xiangtai, Zhou, Chong, Li, Yining, Chen, Kai, Loy, Chen Change
Other Authors: College of Computing and Data Science
Format: Conference Paper
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/180250
http://arxiv.org/abs/2401.02955v2
_version_ 1826119078466551808
author Yuan, Haobo
Li, Xiangtai
Zhou, Chong
Li, Yining
Chen, Kai
Loy, Chen Change
author2 College of Computing and Data Science
author_facet College of Computing and Data Science
Yuan, Haobo
Li, Xiangtai
Zhou, Chong
Li, Yining
Chen, Kai
Loy, Chen Change
author_sort Yuan, Haobo
collection NTU
description The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.
first_indexed 2024-10-01T04:54:10Z
format Conference Paper
id ntu-10356/180250
institution Nanyang Technological University
language English
last_indexed 2024-10-01T04:54:10Z
publishDate 2024
record_format dspace
spelling ntu-10356/1802502024-09-26T05:59:36Z Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change College of Computing and Data Science 2024 European Conference on Computer Vision (ECCV) S-Lab Computer and Information Science Scene understanding Promptable segmentation The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes. Ministry of Education (MOE) Submitted/Accepted version This study is supported under the RIE2020 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). The project is also supported by Singapore MOE AcRF Tier 1 (RG16/21) and the National Key R&D Program of China (No. 2022ZD0161600). 2024-09-26T03:29:46Z 2024-09-26T03:29:46Z 2024 Conference Paper Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K. & Loy, C. C. (2024). Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively. 2024 European Conference on Computer Vision (ECCV). https://dx.doi.org/10.48550/arXiv.2401.02955 https://hdl.handle.net/10356/180250 10.48550/arXiv.2401.02955 http://arxiv.org/abs/2401.02955v2 en IAF-ICP RG16/21 10.21979/N9/L05ULT © 2024 ECCV. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. application/pdf
spellingShingle Computer and Information Science
Scene understanding
Promptable segmentation
Yuan, Haobo
Li, Xiangtai
Zhou, Chong
Li, Yining
Chen, Kai
Loy, Chen Change
Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_full Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_fullStr Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_full_unstemmed Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_short Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
title_sort open vocabulary sam segment and recognize twenty thousand classes interactively
topic Computer and Information Science
Scene understanding
Promptable segmentation
url https://hdl.handle.net/10356/180250
http://arxiv.org/abs/2401.02955v2
work_keys_str_mv AT yuanhaobo openvocabularysamsegmentandrecognizetwentythousandclassesinteractively
AT lixiangtai openvocabularysamsegmentandrecognizetwentythousandclassesinteractively
AT zhouchong openvocabularysamsegmentandrecognizetwentythousandclassesinteractively
AT liyining openvocabularysamsegmentandrecognizetwentythousandclassesinteractively
AT chenkai openvocabularysamsegmentandrecognizetwentythousandclassesinteractively
AT loychenchange openvocabularysamsegmentandrecognizetwentythousandclassesinteractively