Poster

Think before Placement: Common Sense Enhanced Transformer for Object Placement

Yaxuan Qin ⋅ Jiayu Xu ⋅ Ruiping Wang ⋅ Xilin CHEN

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

2024 Poster

Project Page Paper PDF [ Slides] [ Poster] [ Supplemental]

Abstract

Object placement is a task to insert a foreground object into a background scene at a suitable position and size. Existing methods mainly focus on extracting better visual features, while neglecting common sense about the objects and background. It leads to semantically unrealistic object positions. In this paper, we introduce Think Before Placement, a novel framework that effectively combines the implicit and explicit knowledge to generate placements that are both visually coherent and contextually appropriate. Specifically, we first adopt a large multi-modal model to generate a descriptive caption of the background (Think), then output proper position and size of the object (Place). The caption serves as an explicit semantic guidance for the subsequent placement of objects. Using this framework, we implement our model named CSENet, which outperforms baseline methods on the OPA dataset in extensive experiments. Further, we establish the OPAZ dataset to evaluate the zero-shot transfer capabilities of CSENet, where it also shows impressive zero-shot performance across different foreground objects and scenes.

Chat is not available.