We introduce the \textbf{M}ulti-\textbf{M}otion \textbf{D}iscrete \textbf{D}iffusion \textbf{M}odel (M2D2M), a novel approach in human motion generation from action descriptions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, facilitating nuanced and context-sensitive human motion generation. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model initially trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks in motion generation from the text tasks, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.
Live content is unavailable. Log in and register to view live content