Poster
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
Heeseung Yun · Ruohan Gao · Ishwarya Ananthabhotla · Anurag Kumar · Jacob Donley · Chao Li · Gunhee Kim · Vamsi Krishna Ithapu · Calvin Murdock
# 281
Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to the wearer’s behaviors. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of the wearer’s head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive image-to-sphere projections. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including active speaker localization in noisy conversations, audio-based spherical sound source localization, and behavior anticipation in everyday activities.
Live content is unavailable. Log in and register to view live content