Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual and textual features at an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to attain favorable pixel-level segmentation results. In this work, we investigate in CLIP's spatial reasoning mechanism and identify that its failure of dense prediction is caused by a location misalignment issue in the self-attention process. Based on this observation, we propose a training-free adaptation approach for CLIP's semantic segmentation, which only introduces a very simple modification to CLIP but can effectively address the issue of location misalignment. Specifically, we reform the self-attention mechanism with leveraging query-to-query and key-to-key similarity to determine attention scores. Remarkably, this minimal modification to CLIP significantly enhances its capability in dense prediction, improving the original CLIP's 14.1% average zero-shot mIoU over eight semantic segmentation benchmarks to 38.2%, and outperforming the existing SoTA's 33.9% by a large margin.
Live content is unavailable. Log in and register to view live content