This paper explores self-supervised disentangled representation learning within sequential data, focusing on untangling time-independent and time-varying factors in videos. We propose a new model that explicitly accounts for the causal relationship between the static/dynamic variables and improves model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions under which the ground truth factors can be identified, and introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into the framework. The experiments show that the proposed approach outperforms previous SOTA techniques which generalize poorly in more realistic scenarios where the dynamics of a scene are influenced by the content.
Live content is unavailable. Log in and register to view live content