Recent advances in diffusion model has shed light on Text-to-Video(T2I) and Image-to-Video(I2V) generation. As a line of work, pose driven video generation with reference image also gained attention, showing the capability of realistic human dance synthesis. However, previous methods have some remaining challenges. First, the network that encodes the pose information is fine-tuned using the pose videos from the target domain, thus lacking generalizability to diverse poses. Second, as the models are driven by the provided pose videos, the outcomes inevitably depend on the performance of off-the-shelf pose detector. In this paper, we present pose driven video generation methods with reference image that mitigates the aforementioned issues. Unlike previous methods, we utilize the pretrained ControlNet without fine-tuning to leverage its preaquired knowledge from a vast amount of pose-image-caption pairs. To remain the controlnet frozen, we introduce a correspondence layer, enabling the network to train the correspondence between the pose and appearance features. Additionally, by introducing additional temporal layer to the ControlNet, we enhance robustness with respect to pose detector outliers. Extensive experiments demonstrate that the proposed method can achieve promising results in video synthesis tasks, encompassing various poses.
Live content is unavailable. Log in and register to view live content