In this work, we introduce Semantic-SAM, an augmented image segmentation foundation for segmenting and recognizing anything at desired granularities. Compared to the foundational segmentation model SAM, our model has two unique advantages: (i) granularity-controllability in that the model can produce segmentation masks at any desired granularities, from objects to parts to both; (ii) semantic-awareness in that the model simultaneously predicts semantic labels for masks at different granularities. To enable multi-granularity capabilities, we propose a multichoice learning scheme, where each click point generates a set of masks at multiple levels of granularity, correspondx0002ing to a set of ground-truth masks. To achieve semantic awareness, we consolidate multiple datasets of different levels of granularity and train our model using decoupled object- and part-based tasks to facilitate knowledge sharing and transfer among different tasks. To the best of our knowledge, this work is the first attempt to jointly train a model on SA-1B, instance-level, and part-level segmentation datasets. Experimental results and visualizations demonstrate that our model successfully achieves the desired goals. Furthermore, we show that multi-task training using the segmentation task defined on SA-1B and other segmentation tasks (e.g., panoptic and part segmentation) leads to performance gains on all segmentation tasks. In particular, we achieve a new state-of-the-art in COCO panoptic segmentation 60.2 PQ by adding SAM data.
Live content is unavailable. Log in and register to view live content