Skip to yearly menu bar Skip to main content


Poster

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Yini Fang · Jingling Yu · Haozheng Zhang · Ralf van der Lans · Bertram E Shi

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Project Page ]
Tue 1 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than spatial coordinates. This paper introduces the Object-level Attention Transformer (OAT) that predicts human scanpaths as they search for a target object within a cluttered scene of distractor objects. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new positional encoding that better reflects spatial relationships between objects. We evaluated OAT on the Amazon book cover dataset and a new dataset for visual search that we collected. OAT's predicted gaze scanpaths align more closely align with human gaze patterns, compared to algorithms based on spatial attention as measured by established metrics and a novel behavioral-based metric. Our results show the generalizability of OAT, as it accurately predicts human scanpaths for unseen layouts and target objects.

Live content is unavailable. Log in and register to view live content