Co-speech gesture video generation is an enabling technique for numerous digital human applications in the post-ChatGPT era. Substantial progress has been made in creating high-quality talking head videos. However, existing hand gesture video generation methods are largely limited by the widely adopted 2D skeleton-based gesture representation, and still struggle to generate realistic hands. We propose a novel end-to-end audio-driven co-speech video generation pipeline to synthesize human speech videos leveraging 3D human mesh-based representations. By adopting a 3D human mesh-based gesture representation, we present a mesh-grounded video generator that includes a mesh texture-map optimization step followed by a new conditional GAN-based network, and outputs photorealistic gesture videos with realistic hands. Our experiments on the TalkSHOW dataset demonstrate the effectiveness of our method over a baseline that uses 2D skeleton-based representation.
Live content is unavailable. Log in and register to view live content