Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively
The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into...
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Conference Paper |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/180250 http://arxiv.org/abs/2401.02955v2 |
_version_ | 1826119078466551808 |
---|---|
author | Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change |
author2 | College of Computing and Data Science |
author_facet | College of Computing and Data Science Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change |
author_sort | Yuan, Haobo |
collection | NTU |
description | The CLIP and Segment Anything Model (SAM) are remarkable vision foundation
models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas
CLIP is renowned for its zero-shot recognition capabilities. This paper
presents an in-depth exploration of integrating these two models into a unified
framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired
model designed for simultaneous interactive segmentation and recognition,
leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The
former adapts SAM's knowledge into the CLIP via distillation and learnable
transformer adapters, while the latter transfers CLIP knowledge into SAM,
enhancing its recognition capabilities. Extensive experiments on various
datasets and detectors show the effectiveness of Open-Vocabulary SAM in both
segmentation and recognition tasks, significantly outperforming the na\"{i}ve
baselines of simply combining SAM and CLIP. Furthermore, aided with image
classification data training, our method can segment and recognize
approximately 22,000 classes. |
first_indexed | 2024-10-01T04:54:10Z |
format | Conference Paper |
id | ntu-10356/180250 |
institution | Nanyang Technological University |
language | English |
last_indexed | 2024-10-01T04:54:10Z |
publishDate | 2024 |
record_format | dspace |
spelling | ntu-10356/1802502024-09-26T05:59:36Z Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change College of Computing and Data Science 2024 European Conference on Computer Vision (ECCV) S-Lab Computer and Information Science Scene understanding Promptable segmentation The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes. Ministry of Education (MOE) Submitted/Accepted version This study is supported under the RIE2020 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). The project is also supported by Singapore MOE AcRF Tier 1 (RG16/21) and the National Key R&D Program of China (No. 2022ZD0161600). 2024-09-26T03:29:46Z 2024-09-26T03:29:46Z 2024 Conference Paper Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K. & Loy, C. C. (2024). Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively. 2024 European Conference on Computer Vision (ECCV). https://dx.doi.org/10.48550/arXiv.2401.02955 https://hdl.handle.net/10356/180250 10.48550/arXiv.2401.02955 http://arxiv.org/abs/2401.02955v2 en IAF-ICP RG16/21 10.21979/N9/L05ULT © 2024 ECCV. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. application/pdf |
spellingShingle | Computer and Information Science Scene understanding Promptable segmentation Yuan, Haobo Li, Xiangtai Zhou, Chong Li, Yining Chen, Kai Loy, Chen Change Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively |
title | Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively |
title_full | Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively |
title_fullStr | Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively |
title_full_unstemmed | Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively |
title_short | Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively |
title_sort | open vocabulary sam segment and recognize twenty thousand classes interactively |
topic | Computer and Information Science Scene understanding Promptable segmentation |
url | https://hdl.handle.net/10356/180250 http://arxiv.org/abs/2401.02955v2 |
work_keys_str_mv | AT yuanhaobo openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT lixiangtai openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT zhouchong openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT liyining openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT chenkai openvocabularysamsegmentandrecognizetwentythousandclassesinteractively AT loychenchange openvocabularysamsegmentandrecognizetwentythousandclassesinteractively |