EdgeSAM: prompt-in-the-loop distillation for on-device deployment of SAM
This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better...
Main Authors: | , , , |
---|---|
Other Authors: | |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/180234 http://arxiv.org/abs/2312.06660v2 |
Summary: | This paper presents EdgeSAM, an accelerated variant of the Segment Anything
Model (SAM), optimized for efficient execution on edge devices with minimal
compromise in performance. Our approach involves distilling the original
ViT-based SAM image encoder into a purely CNN-based architecture, better suited
for edge devices. We carefully benchmark various distillation strategies and
demonstrate that taskagnostic encoder distillation fails to capture the full
knowledge embodied in SAM. To overcome this bottleneck, we include both the
prompt encoder and mask decoder in the distillation process, with box and point
prompts in the loop, so that the distilled model can accurately capture the
intricate dynamics between user input and mask generation. To mitigate dataset
bias issues stemming from point prompt distillation, we incorporate a
lightweight module within the encoder. As a result, EdgeSAM achieves a 37-fold
speed increase compared to the original SAM, and it also outperforms
MobileSAM/EfficientSAM, being over 7 times as fast when deployed on edge
devices while enhancing the mIoUs on COCO and LVIS by 2.3/1.5 and 3.1/1.6,
respectively. It is also the first SAM variant that can run at over 30 FPS on
an iPhone 14. Code and demo are available at
https://www.mmlab-ntu.com/project/edgesam. |
---|