Skeleton-based Temporal Action Segmentation (STAS) aims at densely segmenting and classifying human actions in long untrimmed skeletal motion sequences. Existing STAS methods primarily model the spatial dependencies among joints and the temporal relationships among frames to generate frame-level one-hot classifications. However, these research overlook the deep mining of semantic relations among joints as well as actions at a linguistic level, which limits the comprehensiveness of skeleton action understanding. In this work, we propose a Language-assisted Skeleton Action Understanding (LaSA) method, leveraging Large-scale Language Models (LLM) to assist in learning semantic relationships among joints and actions. Specifically, in terms of joint relationships, the Joint Relationships Establishment (JRE) module establishes correlations among joints in the feature sequence through attention between joint texts and embeds joint texts as position embeddings to differentiate distinct joints. Regarding action relationships, the Action Relationships Supervision (ARS) module enhances the discrimination across action classes through contrastive learning of single-class action-text pairs and temporally models the semantic associations of adjacent actions by contrasting mixed-class clip-text pairs. Performance evaluation on five public datasets demonstrates that LaSA has achieved state-of-the-art performance.
Live content is unavailable. Log in and register to view live content