Skip to yearly menu bar Skip to main content


Poster

Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks

Manyuan Zhang · Guanglu Song · Xiaoyu Shi · Yu Liu · Hongsheng Li

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Tue 1 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract: In this paper, we investigate how to conduct transfer learning to adapt Stable Diffusion to downstream visual dense prediction tasks such as object detection and semantic segmentation. We focus on fine-tuning the Stable Diffusion model, which has demonstrated impressive abilities in modeling image details and high-level semantics. Through our experiments, we have three key insights. Firstly, we demonstrate that for dense prediction tasks, the denoiser of Stable Diffusion can serve as a stronger feature encoder compared to visual-language models pre-trained with contrastive training (e.g., CLIP). Secondly, we show that the quality of extracted features is influenced by the diffusion sampling step $t$, sampling layer, cross-attention map, model generation capacity, and textual input. Features from Stable Diffusion UNet's upsampling layers and earlier denoising steps lead to more discriminative features for transfer learning to downstream tasks. Thirdly, we find that tuning the Stable Diffusion to downstream tasks in a parameter-efficient way is feasible. We search for the best protocol for effective tuning via reinforcement learning and finally achieve similar performance to full tuning by only tuning 0.81\% of Stable Diffusion's parameters.

Live content is unavailable. Log in and register to view live content