Skip to yearly menu bar Skip to main content


Poster

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Xiaoyu Zhu · Hao Zhou · Pengfei Xing · Long Zhao · Hao Xu · Junwei Liang · Alexander G. Hauptmann · Ting Liu · Andrew Gallagher

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Tue 1 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Traditional 3D scene understanding techniques rely on supervised learning from densely annotated 3D datasets. However, the collection and annotation of 3D data is expensive and tedious, which leads to the scarcity of labeled training data. In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D scene understanding. We propose a novel method, namely Diff2Scene, leverages frozen representations from text-image discriminative and generative models, along with salient-aware and geometric-aware masks, for open-vocabulary scene understanding. Diff2Scene gets rid of any labeled 3D data and effectively identifies objects, appearances, materials, locations and their compositions in 3D scenes using a single model. We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods in open-vocabulary 3D semantic segmentation tasks. In particular, Diff2Scene improves the state-of-the-art method on ScanNet200 by 12%.

Live content is unavailable. Log in and register to view live content