Poster
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
Zixu Cheng · Yujiang Pu · Shaogang Gong · Parisa Kordjamshidi · Yu Kong
# 64
Strong Double Blind |
Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner, utilizing contrastive learning to achieve compositional generalization. However, these approaches have significant drawbacks in constructing negative queries: (1) Only dominant verbs and nouns are considered, overlooking the impact of other primitives such as prepositions and adverbs. (2) These negative samples are formed through random sampling and recombination, resulting in numerous semantically implausible combinations and forcing the model to learn unrealistic semantic differences. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks, Charades-CG and ActivityNet-CG, validate the effectiveness and generalizability of our proposed method. Our code will be available soon.
Live content is unavailable. Log in and register to view live content