Skip to yearly menu bar Skip to main content


Poster

Sapiens: Foundation for Human Vision Models

Rawal Khirodkar · Timur Bagautdinov · jltmtzc@gmail.com Martinez · Zhaoen Su · Austin T James · Peter Selednik · Stuart Anderson · Shunsuke Saito

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Wed 2 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning foundational models pretrained on over 300 million in-the-wild human images. Our key insight is that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. We demonstrate that resulting foundational models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks significantly improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing complex baselines across various human-centric benchmarks. Specifically, we achieve significant improvements over the prior state-of-the-art on COCO-Wholebody (pose) by 7.9 mAP, CIHP (part-seg) by 1.3 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

Live content is unavailable. Log in and register to view live content