Poster

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu ⋅ Zhaoyang Zeng ⋅ Tianhe Ren ⋅ Feng Li ⋅ Hao Zhang ⋅ Jie Yang ⋅ Qing Jiang ⋅ Chunyuan Li ⋅ Jianwei Yang ⋅ Hang Su ⋅ Jun Zhu ⋅ Lei Zhang

2024 Poster

Paper PDF [ Supplemental]

Abstract

In this paper, we develop an open-set object detector called Grounding DINO by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well in all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data in COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP.

Chat is not available.