This work addresses cross-domain semantic segmentation. While recent encoder-heavy'' CNNs and transformers led to significant advances, we introduce a new transformer with a lighter encoder and more complex decoder with query tokens for predicting segmentation masks, called ADFormer. The domain gap between the source and target domains is reduced with two mechanisms. First, we decompose cross-attention in the decoder into domain-independent and domain-specific parts to enforce the query tokens interact with the domain-independent aspects of the image tokens, shared by the source and target domains, rather than domain-specific counterparts which induce the domain gap. Second, we use the gradient reverse block to control back-propagation of the gradient, and hence introduce adversarial learning in the decoder of ADFormer. Our results on two benchmark domain shifts -- GTA to Cityscapes and SYNTHIA to Cityscapes -- show that ADFormer outperforms SOTA
encoder-heavy'' methods with significantly lower complexity.
Live content is unavailable. Log in and register to view live content