Subject-driven generation for text-to-image diffusion models aims to encode and inverse specific textual prompts in order to generate personalized images with particular content. Previous studies successfully achieved such a goal by using an optimization-based textual inversion or direct-regression-based concept encoding strategy. However, there are still challenges on how to realize fast and effective prompt inversion while guaranteeing the generalization of the original diffusion models. Motivated by the advantages of both optimization-based and direct-regression-based methods, in this study we proposed a novel hybrid prompt inversion framework called ~\name~ to achieve efficient subject-driven generation of text-to-image diffusion models. In detail, we address the limitations caused by the current optimization-based and direct-regression-based methods by designing a novel hybrid prompt inversion framework and combining it with a mask-guided multi-word text encoding module to enable a fast and robust prompt inversion, additionally, we import a hybrid textual feature fusion module to enhance the representation of the textual feature during learning. As a result, our framework manages to inverse arbitrary visual concepts to a pre-trained diffusion model in an effective and fast way even learning from a single image, and maintaining the general generation ability of the original model. The extensive experiments reveal the effectiveness of our method.
Live content is unavailable. Log in and register to view live content