Recent advancements in personalizing text-to-image (T2I) diffusion models have showcased their ability to generate images grounded in personalized visual concepts with just a few user-provided examples. However, these models often face challenges in preserving high visual fidelity, especially when adjusting scenes based on textual descriptions. To tackle this issue, we present ComFusion, an innovative strategy that utilizes pretrained models to create compositions of user-supplied subject images and predefined text scenes. ComFusion incorporates a class-scene prior preservation regularization, utilizing composites of subject class and scene-specific knowledge from pretrained models to boost generation fidelity. Moreover, ComFusion employs coarse-generated images to ensure they are in harmony with both the instance images and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the subject's essence and ensuring scene fidelity. Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
Live content is unavailable. Log in and register to view live content