Vision transformers have demonstrated promising results and are core components in many tasks. While existing works have explored diverse interaction or transformation modules to process image tokens, most of them still focus on context feature extraction, supplemented with the spatial information injected through additional positional embedding. However, the local positional information within each image token hinders effective spatial scene modeling, making the learned representation hard to directly adapt to downstream tasks, especially those that require high-resolution fine-tuning or 3D scene understanding. To solve this challenge, we propose SpatialFormer, an efficient vision transformer architecture designed to facilitate adaptive spatial modeling for generalizable image representation learning. Specifically, we accompany the image tokens with a set of adaptive spatial tokens to represent the context and spatial information respectively. Each spatial token is initialized with its positional encoding, augmented with learnable embeddings to introduce essential spatial priors that enhance the context features. We employ a decoder-only architecture to enable efficient interaction between the two types of tokens. Our approach learns transferable image representation with enhanced abilities for scene understanding. Moreover, the generated spatial tokens can serve as enhanced initial queries for task-specific decoders, facilitating adaptations to downstream tasks. Extensive experiments on standard image classification and downstream 2D and 3D perception tasks demonstrate the efficiency and transferability of the proposed SpatialFormer architecture.
Live content is unavailable. Log in and register to view live content