Recent generative models excel in creating high-quality single-human images but fail in complex multi-human scenarios, failing to capture accurate structural details like quantities, identity accuracy, layouts and postures. We introduce a novel approach, Chains, which enhances initial text prompts into detailed human conditions using a step-by-step process. Chains utilize a series of condition nodes—text, quantity, layout, skeleton, and 3D mesh—each undergoing an independent diffusion process. This enables high-quality human generation and advanced scene layout management in diffusion models. We evaluate Chains against a new benchmark for complex multi-human scene synthesis, showing superior performance in human quality and scene accuracy over existing methods. Remarkably, Chains achieves this with under 0.45 seconds for a 20-step inference, demonstrating both effectiveness and efficiency.
Live content is unavailable. Log in and register to view live content