An overarching goal for computer-aided perception systems is the holistic understanding of the human-centric 3D world, including faithful reconstructions of humans, scenes, and their global spatial relationships. While recent progress in monocular 3D reconstruction has been made for footage of either humans or scenes alone, jointly reconstructing both humans and scenes along with their global spatial information remains an unsolved challenge. To address this challenge, we introduce a novel and unified framework that simultaneously achieves temporally and spatially coherent 3D reconstruction of a static scene with dynamic humans from a monocular RGB video. Specifically, we parameterize temporally consistent canonical human models and static scene representations using two neural fields in a shared 3D space. We further develop a global optimization framework that considers physical constraints imposed by potential human-scene interpenetration and occlusion. Compared to independent reconstructions, our framework enables detailed and holistic geometry reconstructions of both humans and scenes. Moreover, we introduce a new synthetic dataset for quantitative evaluations. Extensive experiments and ablation studies on both real-world and synthetic videos demonstrate the efficacy of our framework in monocular human-scene reconstruction. Code and data are publicly available on our \href{https://lxxue.github.io/human-scene-recon/}{project page}.
Live content is unavailable. Log in and register to view live content