Existing 2D-to-3D pose lifting networks suffer from poor performance in cross-dataset benchmarks. Although 2D keypoints joined by "stick-figure’’ limbs is the dominant trend, stick-figures do not preserve occlusion information that is inherent in an image, resulting in significant ambiguities that are ruled out when occlusion information is present. In addition, datasets with ground truth 3D poses are much harder to obtain in contrast to similar human annotated 2D datasets. To address these issues, we propose to replace stick figures with abstract images ---figures with opaque limbs that preserve occlusion information while implicitly encoding joint locations. We then break down the pose estimation task into two stages: (1) Generating an abstract image from a real image, and (2) garnering the pose from the abstract image. Crucially, given the GT 3D keypoints for a particular pose, we can synthesize an arbitrary number of abstract images of the same pose as seen from arbitrary cameras, even without a part map. Given a set of 3D GT keypoints, this allows training of Stage (2) on an unlimited dataset without over-training, which in turn allows us to correctly interpret poses from arbitrary viewpoints not included in the original dataset. Additionally, our unlimited training of Stage 2 allows good generalizations across datasets, demonstrated through a significant improvement in cross-dataset benchmarks, while still showing competitive performance in same-dataset benchmark.
Live content is unavailable. Log in and register to view live content