Skip to yearly menu bar Skip to main content


Poster

KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval

Xianwei Zhuang · Hongxiang Li · Xuxin Cheng · Zhihong Zhu · Yuxin Xie · Yuexian Zou

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Wed 2 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

Existing video-text retrieval methods predominantly focus on designing diverse cross-modal interaction mechanisms between captions and videos. However, those approaches diverge from human learning paradigms, where humans possess the capability to seek and associate knowledge from an open set, rather than rote memorizing all text-video instances. Motivated by this, we attempt to decouple knowledge from retrieval models through multi-grained knowledge stores and identify two significant benefits of our knowledge-decoupling strategy: (1) it ensures a harmonious balance between knowledge memorization and retrieval optimization, thereby improving retrieval performance; and (2) it can promote incorporating diverse open-world knowledge to augment video-text retrieval. To efficiently integrate information from knowledge stores, we further introduce a novel retrieval framework termed KDProR, which utilizes our proposed Expectation-Knowledge-Maximization (EKM) algorithm for optimization. Specifically, in E-step, KDProR obtains relevant contextual semantics from knowledge stores and achieves efficient knowledge injection through interpolation and alignment correction. During the K-step, KDProR calculates the knowledge KNN distribution by indexing the Top-K acquired knowledge to refine the retrieval distribution, and in M-step, KDProR optimizes the retrieval model by maximizing the likelihood of the objective. Extensive experiments on various benchmarks prove that KDProR significantly outperforms previous state-of-the-art methods across all metrics. Remarkably, KDProR can uniformly and efficiently incorporate diverse open-world knowledge and is compatible with different interaction mechanisms and architectures.

Live content is unavailable. Log in and register to view live content