Skip to yearly menu bar Skip to main content


Poster

Do text-free diffusion models learn discriminative visual representations?

Soumik Mukhopadhyay · Matthew A Gwilliam · Yosuke Yamaguchi · Vatsal Agarwal · Namitha Padmanabhan · Archana Swaminathan · Tianyi Zhou · Jun Ohya · Abhinav Shrivastava

[ ]
Tue 1 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Diffusion models have proven to be state-of-the-art methods for generative tasks. These models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. However, text-free diffusion models have typically not been explored for discriminative tasks. In this work, we take a pre-trained unconditional diffusion model and analyze its features post hoc. We find that the intermediate feature maps of the pre-trained U-Net are diverse and have hidden discriminative representation properties. To unleash the potential of these latent properties of diffusion models, we present novel aggregation schemes. Firstly, we propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of different diffusion U-Net blocks and noise steps. Next, we also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art representation learning methods for discriminative tasks -- image classification with full and semi-supervision, transfer for fine-grained classification, object detection, and semantic segmentation.

Live content is unavailable. Log in and register to view live content