Skip to yearly menu bar Skip to main content


Poster

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Yue Fan · Xiaojian Ma · Rujie Wu · yuntao du · Jiaqi Li · Zhi Gao · Qing Li

[ ]
Wed 2 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. \method demonstrates impressive performances on several long-horizon video understanding benchmarks, on average increasing 6.6% on NExT-QA and 26.0% on EgoSchema over baselines. The code will be released to the public.

Live content is unavailable. Log in and register to view live content