We present a method to create diffusion-based Video models from pretrained Text-to-Image (T2I) models, overcoming limitations of existing methods. We propose a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. We demonstrate superior performance through extensive experiments and comparisons.
Live content is unavailable. Log in and register to view live content