Modality alignment has been of paramount importance in recent developments of multimodal learning, which has inspired many innovations in multimodal networks and pre-training tasks. Single-stream networks can effectively leverage self-attention mechanisms to facilitate modality interactions but suffer from high computational complexity and limited applicability to downstream retrieval tasks. In contrast, dual-stream networks address these issues but ignore the significance of modality alignment. In this paper, we develop a multimodal learning method that integrates the advantages of modality alignment from single-stream networks into the dual-stream network by introducing object-oriented anchors to bridge alignment between image and text modalities. Object-oriented anchors are generated effectively and circumvent the need for object detection boxes as previous region-based approaches, while also preserving explicit semantics for modality interactions. Additionally, we design fine-grained token-level asymmetry alignment between modalities and multiview mining to promote modality alignment. To the best of our knowledge, we are the first to apply object-oriented tokens in multimodal pre-training, yielding significant benefits. Extensive experimental results validate the effectiveness of our method, demonstrating that the proposed method outperforms most previous arts in various downstream tasks, particularly when considering comparable data and model scales.
Live content is unavailable. Log in and register to view live content