Poster

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Yulin Luo ⋅ Ruichuan An ⋅ Bocheng Zou ⋅ Yiming Tang ⋅ Jiaming Liu ⋅ Shanghang Zhang

2024 Poster

Project Page Paper PDF [ Poster]

Abstract

Subpopulation structure is a set of hierarchical relations among several subpopulations determined by a certain criteria. Discovering such structure provides comprehensive understanding of the dataset, which is benefitial to many downstream tasks, such as subpopulation shifts and slice discovery. Despite important, we find there has been no work that systematically explore the subpopulation structure of datasets. Considering that solving this task requires the method to have a broad understanding of various aspects of the datasets, in this work, we leverage the world knowledge, summarization, and instruction-following capabilities of Large Language Model (LLM) to explore the latent subpopulation structure of image datasets. Specifically, we propose a novel approach named Subpopulation Structure Discovery with Large Language Models (SSD-LLM), whose core idea is to generate and analyze the informative image captions and then summarize the structure characteristic of datasets based on the analysis using LLM. SSD-LLM consists of two novel prompt engineering components, Criteria Initialization and Criteria Self-Refinement, which ensures an token-efficient and reliable discovery process. SSD-LLM offers a unified paradigm to address multiple downstream tasks with simple task-specific prompt tuning, including dataset organization, longt tail attribute identification, slice discovery and our proposed slice prediction. We validate the effectiveness of SSD-LLM through these subpopulation-related tasks. We hope to inspire the community to explore potential of LLM as dataset analyst.

Chat is not available.