Weakly Supervised Semantic Segmentation (WSSS) with image-level supervision typically acquires object localization information from Class Activation Maps (CAMs). While Vision Transformers (ViTs) in WSSS have been increasingly explored for their superior performance in understanding global context, CAMs from ViT still show imprecise localization in boundary areas and false-positive activation. This paper proposes a novel WSSS framework that targets these issues based on the information from the frequency domain. In our framework, we introduce the Magnitude-mixing Aided Phase Accentuation (MAPA) module, which guides the classifier to prioritize phase information containing high-level semantic details. By perturbing and mixing the magnitude, MAPA guides the classifier to accentuate and concentrate on the shape information in the phase, thereby leading to finer distinctions in CAMs boundary regions. Additionally, inspired by empirical observations that the classification "shortcut" in the frequency domain can induce false positives in CAMs, we introduce a Frequency Shortcut Deterrent (FSD) module. This module aims to discourage the formation of such shortcuts, thereby mitigating false positives. The effectiveness of our approach is demonstrated by achieving new state-of-the-art performance on both PASCAL VOC 2012 and MS COCO 2014 datasets. The code will be released.
Live content is unavailable. Log in and register to view live content