Poster

Do text-free diffusion models learn discriminative visual representations?

Soumik Mukhopadhyay · Matthew Gwilliam · Yosuke Yamaguchi · Vatsal Agarwal · Namitha Padmanabhan · Archana Swaminathan · Tianyi Zhou · Jun Ohya · Abhinav Shrivastava

2024 Poster

Paper PDF [ Poster] [ Supplemental]

Abstract

Diffusion models have proven to be state-of-the-art methods for generative tasks. These models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. However, text-free diffusion models have typically not been explored for discriminative tasks. In this work, we take a pre-trained unconditional diffusion model and analyze its features post hoc. We find that the intermediate feature maps of the pre-trained U-Net are diverse and have hidden discriminative representation properties. To unleash the potential of these latent properties of diffusion models, we present novel aggregation schemes. Firstly, we propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of different diffusion U-Net blocks and noise steps. Next, we also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art representation learning methods for discriminative tasks -- image classification with full and semi-supervision, transfer for fine-grained classification, object detection, and semantic segmentation.

Chat is not available.