Poster
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee · Yiran Luo · Tejas Gokhale · Yezhou Yang · Chitta R Baral
# 140
Strong Double Blind |
The rapid progression of Text-to-Image (T2I) and Multimodal Large Language Models (MLLMs) has resulted in their widespread adoption across multiple computer vision and natural language processing tasks. However, a common mode of failure that persists across both classes of models is their inability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves and evaluates spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 101 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also introduce the REVISION benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an efficient approach for developing reasoning-aware generative models.
Live content is unavailable. Log in and register to view live content