This paper explores the spatio-temporal video grounding (STVG) task, which aims at localizing a particular object corresponding to a given textual description in an untrimmed video. Existing approaches mainly resort to object-level manual annotations as the supervision for addressing this challenging task. Such a paradigm heavily constrains the scalability of processing large-scale unlabeled data. To this end, we present a novel framework that is capable of grounding the target object relying only on the video-sentence correspondence. Specifically, our model re-formulates the original STVG task as two cross-modal alignment sub-problems: region-phrase and frame-sentence. Since the absence of ground-truth alignments during the training stage, we treat them as latent variables and learn to model the joint conditional distribution by reconstructing the interactions of entities in the video. The entire framework can be effectively optimized by the variational Expectation-Maximization (EM) algorithm, which alternates between two updating steps for progressively maximizing the likelihood of query sentence, thereby approximating the real cross-modal assignment. Extensive experiments on two video benchmarks (VidSTG and HC-STVG) further show the effectiveness of the proposed method.
Live content is unavailable. Log in and register to view live content