We explore an innovative region-based visual token representation and present the REgion-native AutoencoDER ("Reader"). In contrast to the majority of previous methods, which represent each image as a grid-shaped tokens map, "Reader" perceives each image into sequential region-based tokens, with each token corresponding to an object or one part of an object in the image. Specifically, "Reader" comprises both an encoder and a decoder. The encoder can partition each image into an adaptive number of arbitrary-shaped regions and encode each region into a token. Subsequently, the decoder utilizes this adaptive-length token sequence to reconstruct the original image. Experimental results demonstrate that such region-based token representation possesses two main notable characteristics. Firstly, it achieves highly efficient image encoding. "Reader" can adaptively use more regions to represent complex areas and fewer regions in simpler ones, thus avoiding information redundancy. Consequently, it achieves superior reconstruction fidelity compared to previous methods, despite using significantly fewer tokens for each image. Secondly, the region-based manner enables manipulation on a local region without causing global changes. As a result, "Reader" inherently supports diverse image editing operations, including erasing, adding, replacing, and modifying shapes on the objects, and achieves excellent performance in the image editing benchmark of smile transferring. Codes will be provided for reproducibility.
Live content is unavailable. Log in and register to view live content