Skip to yearly menu bar Skip to main content


Poster

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Agneet Chatterjee · Yiran Luo · Tejas Gokhale · Yezhou Yang · Chitta R Baral

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Project Page ]
Tue 1 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

The rapid progression of Text-to-Image (T2I) and Multimodal Large Language Models (MLLMs) has resulted in their widespread adoption across multiple computer vision and natural language processing tasks. However, a common mode of failure that persists across both classes of models is their inability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves and evaluates spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 101 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also introduce the REVISION benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an efficient approach for developing reasoning-aware generative models.

Live content is unavailable. Log in and register to view live content