Poster
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu · Hao Cheng · Haotian Liu · Hao Zhang · Feng Li · Tianhe Ren · Xueyan Zou · Jianwei Yang · Hang Su · Jun Zhu · Lei Zhang · Jianfeng Gao · Chunyuan Li
# 1
In this paper, we introduce LLaVA-Plus, an end-to-end training approach to systematically expanding the capabilities of large multimodal models (LMM), towards building general-purpose multimodal agents. It maintains a skill repository that contains a wide range of vision and vision-language pre-trained models as multimodal tools. Based on the user instruction and input image, LMM is trained to activate the appropriated tools when needed, grasping skills on the fly and aggregating the tool execution results to complete the real-world tasks in the wild. To facilitate the model capability on learning to use skills, we make the first attempt to build multimodal instruction-following data for tool use, covering skills in visual understanding, generation, external knowledge and their compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities, and extends many new capabilities. Compared with large language model (LLM) based tool use methods, LLaVA-Plus is distinct in that the query image is considered throughout the entire interaction process, yielding higher multimodal tool use performance and enabling new scenarios.
Live content is unavailable. Log in and register to view live content