Abstract:
The unsupervised task of Joint Alignment (JA) of images is complicated by several hurdles such as high complexity, geometric distortions, and convergence to poor local optima or (due to undesired trivial solutions) even poor global optima. While it was recently shown that Vision Transformers (ViT) features are useful for, among other things, JA, it is important to understand that, by themselves, these features do not completely eliminate the aforementioned issues. Thus, even with ViT features, researchers opt to tackle JA using both 1) expensive models and 2) numerous regularization terms. Unfortunately, that approach leads to long training times and hard-to-tune (and dataset-specific) regularization hyperparameters. Our approach is different. We introduce the Spatial Joint Alignment Model (SpaceJAM), a lightweight method that simplifies JA considerably. Specifically, the method consists of a novel loss function, a Lie-algebraic parametrization, an efficient handling of flips, a small autoencoder, and a (recurrent) small Spatial Transformer Net. The entire model has only $\sim$16K trainable parameters. Of note, the regularization-free SpaceJAM obviates the need for explicitly maintaining an atlas during training. Optionally, and after solving the JA, SpaceJAM can generate an atlas in a single forward pass. Evaluated on the SPair-71K and CUB datasets, and compared to existing methods, SpaceJAM achieves better alignment capabilities and does so with orders of magnitude fewer trainable parameters and at least a 10x speedup in training time. Our code will be released upon acceptance.
Live content is unavailable. Log in and register to view live content