The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and high latency. In this paper, we present \textbf{MobileDiffusion}, an ultra-efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to minimize model size and FLOPs, while preserving image generation quality. Additionally, we revisit the advanced sampling technique by diffusion-GAN, and make one-step sampling compatible to downstream applications trained on the base model. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed technologies. With them, MobileDiffusion achieves instant text-to-image generation on mobile devices, establishing a new state of the art.
Live content is unavailable. Log in and register to view live content