Vision-and-language foundation models have shown impressive hallmarks on zero-shot image classification, where the target classes are represented in text descriptions with no labeled image examples. Recent work spans such powerful image and text correspondence to open-vocabulary segmentation, \ie, predicting pixel and text correspondence without pixel-level supervision on the unseen target classes. Plenty of the previous art casts this task as pixel-to-text classification without the goal of comprehending objects within an image. We believe segmentation is a visual understanding task and advocate decoupling segmentation from visual grounding. To this end, we introduce Lazy Visual Grounding for zero-shot open-vocabulary segmentation. Lazy visual grounding first discovers distinguishable visual units as object masks with iterative graph cuts and then assigns text on the discovered visual objects in a late interaction manner. Our model is training-free yet shows great performance on four public datasets: Pascal VOC, COCO-object, COCO-stuff, and ADE 20K, and especially, demonstrates visually appealing segmentation results, indicating the model capability to comprehend visual objectness. Code and data will be released once accepted.
Live content is unavailable. Log in and register to view live content