Poster

HARIVO: Harnessing Text-to-Image Models for Video Generation

Mingi Kwon · Seoung Wug Oh · Yang Zhou · Joon-Young Lee · Difan Liu · Haoran Cai · Baqiao Liu · Feng Liu · Youngjung Uh

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

2024 Poster

Paper PDF [ Poster] [ Supplemental]

Abstract

We present a method to create diffusion-based Video models from pretrained Text-to-Image (T2I) models, overcoming limitations of existing methods. We propose a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. We demonstrate superior performance through extensive experiments and comparisons.

Chat is not available.