Sign Language Translation (SLT) aims to translate sign videos into spoken sentences. While gloss sequences, the written approximation of sign videos, provide informative alignment supervision for visual representation learning in SLT, the associated high cost of gloss annotations hampers the scalability. Recent works have yet to achieve satisfactory results without gloss annotations. In this study, we attribute the challenge to the flexible correspondence between visual and textual tokens, and aim to address it by constructing a gloss-like constraint from spoken sentences. Specifically, we propose a Visual Alignment Pre-training (VAP) scheme to exploit visual information by aligning visual and textual tokens in a greedy manner. The VAP scheme enhances visual encoder in capturing semantic-aware visual information and facilitates better adaptation with pre-trained translation modules. Experimental results across four SLT benchmarks demonstrate the effectiveness of the proposed method, which can not only generate reasonable alignments but also significantly narrow the performance gap with gloss-based methods.
Live content is unavailable. Log in and register to view live content