Abstract:
In this paper, we investigate how to conduct transfer learning to adapt Stable Diffusion to downstream visual dense prediction tasks such as object detection and semantic segmentation. We focus on fine-tuning the Stable Diffusion model, which has demonstrated impressive abilities in modeling image details and high-level semantics. Through our experiments, we have three key insights. Firstly, we demonstrate that for dense prediction tasks, the denoiser of Stable Diffusion can serve as a stronger feature encoder compared to visual-language models pre-trained with contrastive training (e.g., CLIP). Secondly, we show that the quality of extracted features is influenced by the diffusion sampling step $t$, sampling layer, cross-attention map, model generation capacity, and textual input. Features from Stable Diffusion UNet's upsampling layers and earlier denoising steps lead to more discriminative features for transfer learning to downstream tasks. Thirdly, we find that tuning the Stable Diffusion to downstream tasks in a parameter-efficient way is feasible. We search for the best protocol for effective tuning via reinforcement learning and finally achieve similar performance to full tuning by only tuning 0.81\% of Stable Diffusion's parameters.
Live content is unavailable. Log in and register to view live content