In the present study, we introduce a methodology for human image animation that exploits a 3D human parametric model within a latent diffusion framework to improve shape alignment and motion guidance in contemporary human generative techniques. Our method employs the SMPL model as the 3D human parametric model to provide a unified representation of body shape and pose, facilitating the capture of intricate human geometry and motion characteristics from source videos and representing an obvious advancement in the generation of dynamic human videos. Specifically, we incorporate the rendered depth images, normal maps, and semantic maps derived from the SMPL sequences, along with skeleton-based motion guidance, to enrich the input to the latent diffusion model with comprehensive 3D shape information and detailed pose attributes. By weighting the shape and motion latent representations through self-attention mechanisms in the spatial domain, we utilize a multi-layer semantic fusion of these latent representations as a conditioning in the latent diffusion model for human image animation. The effectiveness and versatility of our methodology have been verified through extensive experiments conducted on various datasets, demonstrating its ability to generate high-quality human animations that accurately capture both pose and shape variations.
Live content is unavailable. Log in and register to view live content