Skip to yearly menu bar Skip to main content


Poster

MultiGen: Zero-shot Image Generation from Multi-modal Prompts

Zhi-Fan Wu · Lianghua Huang · Wei Wang · Yanheng Wei · Yu Liu

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Wed 2 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

The field of text-to-image generation has witnessed substantial advancements in the preceding years, allowing the generation of high-quality images based solely on text prompts. However, accurately describing objects through text alone is challenging, necessitating the integration of additional modalities like coordinates and images for more precise image generation. Existing methods often require fine-tuning or only support using single object as the constraint, leaving the zero-shot image generation from multi-object multi-modal prompts as an unresolved challenge. In this paper, we propose MultiGen, a novel method designed to address this problem. Given an image-text pair, we obtain object-level text, coordinates and images, and integrate the information into an "augmented token" for each object. The augmented tokens serve as additional conditions and are trained alongside text prompts in the diffusion model, enabling our model to handle multi-object multi-modal prompts. To manage the absence of modalities during inference, we leverage a coordinate model and a feature model to generate object-level coordinates and features based on text prompts. Consequently, our method can generate images from text prompts alone or from various combinations of multi-modal prompts. Through extensive qualitative and quantitative experiments, we demonstrate that our method not only outperforms existing methods but also enables a wide range of tasks.

Live content is unavailable. Log in and register to view live content