We present 4Diff, a 3D-aware diffusion model addressing the exo-to-ego viewpoint translation problem. This task involves generating first-person (egocentric) view images from third-person (exocentric) images. Leveraging the diffusion model's ability to generate photorealistic images, we propose a transformer-based diffusion model that incorporates geometry priors via the proposed mechanisms: (i) egocentric prior rendering and (ii) 3D-aware rotary cross-attention. The former integrates egocentric layout cues through point cloud rasterization, while the latter incorporates exocentric semantic features by guiding attention between diffusion model feature maps and exocentric semantic features, considering their geometric relationships. Our experiments on the challenging and diverse Ego-Exo4D multiview dataset demonstrate superior performance compared to state-of-the-art approaches. Notably, our approach exhibits robust generalization to novel environments not encountered during training. The code and pretrained models will be made public.
Live content is unavailable. Log in and register to view live content