Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using the diffusion-based methods, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based methods usually adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated image and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of generated images. Experimental results on personalized text-to-image generation benchmark datasets show that our proposed approach surpasses existing state-of-the-art methods by a large margin on visual fidelity while preserving the text-alignment.
Live content is unavailable. Log in and register to view live content