Training a temporal action segmentation (TAS) model on long and untrimmed videos requires gathering framewise video annotations, which is very costly. We propose a two-stage active learning framework to efficiently learn a TAS model using only a small amount of video annotations. Our framework consists of three components that work together in each active learning iteration. 1) Using current labeled frames, we learn a TAS model and action prototypes using a novel contrastive learning method. Leveraging prototypes not only enhances the model performance, but also increases the computational efficiency of both video and frame selection for labeling, which are the next components of our framework. 2) Using the currently learned TAS model and action prototypes, we select informative unlabeled videos for annotation. To do so, we find unlabeled videos that have low alignment scores to learned action prototype sequences in labeled videos. 3) To annotate a small subset of informative frames in each selected unlabeled video, we propose a video-aligned summary selection method and an efficient greedy search algorithm. By evaluation on four benchmark datasets (50Salads, GTEA, Breakfast, CrossTask), we show that our method significantly reduces the annotation costs, while consistently surpassing baselines. We further extend our framework to a semi-supervised active learning setting. We show that we can obtain the full supervision performance using only a small amount of labeled frames. To the best of our knowledge, this is the first work studying active learning for TAS.
Live content is unavailable. Log in and register to view live content