Poster

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

Junhao Zhang · Mutian Xu · Jay Zhangjie Wu · Chuhui Xue · Wenqing Zhang · XIAOGUANG HAN · Song Bai · Mike Zheng Shou

2024 Poster

Paper PDF [ Supplemental]

Abstract

This paper studies visual representation learning with diffusion-generated synthetic images. We start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent representation learning methods i.e., contrastive learning, masked modeling, and vision-language pretraining) on diffusion-generated synthetic data and introduce customized solutions by fully exploiting the aforementioned free attention masks, namely Free-ATM. Comprehensive experiments demonstrate Free-ATM's ability to enhance the performance of various representation learning frameworks when utilizing synthetic data. This improvement is consistent across diverse downstream tasks including image classification, detection, segmentation and image-text retrieval. Meanwhile, by utilizing Free-ATM, we can accelerate the pretraining on synthetic images significantly and close the performance gap between representation learning on synthetic data and real-world scenarios.

Chat is not available.