Earth Observation (EO) presents a unique opportunity to explore self-supervised multimodal learning, given its access to vast and diverse data captured by various sensors. However, current multimodal EO datasets and models often consider modalities from a single data type, either mono-date images or time series, which limits their expressivity. We introduce OmniSat, a novel architecture that exploits the natural alignment between multiple EO modalities to learn expressive multimodal representations without labels. We augment an existing dataset with new modalities to demonstrate the advantages of combining modalities of different natures. We evaluate OmniSat and various state-of-the-art approaches on two relevant downstream tasks: forestry and land cover classification. Our results show that OmniSat can learn rich representations in an unsupervised manner, leading to performance improvements in the semi- and fully-supervised settings, even when only one modality is available at inference. Our code, weights, and dataset are available at https://github.com/gastruc/OmniSat.
Live content is unavailable. Log in and register to view live content