Video summarization plays a vital role in improving video browsing efficiency and has various applications in action recognition and information retrieval. In order to generate summaries that can provide key information, existing works have been proposed to simultaneously explore the contribution of both long-range and short-range temporal cues. However, they rarely consider the potential correspondence between temporal cues at different granularity within video sequences, making it insufficient to ensure detailed video understanding. In order to solve this issue, we propose a novel video summarization framework, namely Bgm4Video, based on the graph-matching mechanism, which models the potential contextualized relationship across multi-granularity temporal cues. The proposed framework is composed of two dominant components including (i) a temporal encoder (TE) that explores both coarse-grained and fine-grained contextual information within videos and (ii) a bidirectional graph transmission (BGT) module that models the interrelationship across multi-granularity temporal cues. By grasping the contextual correspondence, our method allows for further refining temporal representations to precisely pinpoint valuable segments. We demonstrate the advantage of our components through an extensive ablation study. We also evaluate our full approach on the video summarization task and demonstrate improvements over state-of-the-art on the popular benchmarks.
Live content is unavailable. Log in and register to view live content