Compositional Zero-Shot Learning (CZSL) aims to classify unseen state-object compositions using seen primitives. Previous methods commonly map an identical primitive from different compositions to the same area within embedding space, aiming to establish primitive representation or assess decoding proficiency. However, relying solely on the intersection area of primitive concepts might overlook nuanced semantics due to conditional variance, thereby limiting the model's capacity to generalize to unseen compositions. In contrast, our approach constructs primitive representations by considering the union area of primitives. We propose a Multiple Representation of Single Primitive learning framework (termed MRSP) for CZSL, which captures composition-relevant features through a state-object-composition three-branch cross-attention architecture. Specifically, the input image feature cross-attends to multiple state, object, and composition features and the prediction scores are adaptively adjusted by combining the output of each branch. Extensive experiments on three benchmarks in both closed-world and open-world settings showcase the superior effectiveness of MRSP.
Live content is unavailable. Log in and register to view live content