2026 moment retrieval;zero-shot learning;cross-modal learning

Resilient Semantic Pseudo-Text Embedding for Zero-Shot Video Moment Retrieval

Zhang, Donglin and Shi, Weixiang and Wu, Xiao-Jun and Kittler, Josef

With the explosive growth of video data, video moment retrieval (VMR) has attracted increasing attention due to its ability to localize semantically relevant moments in untrimmed videos. However, existing VMR approaches usually rely on annotated video-text correspondences or temporal annotations, both of which require significant human effort and are costly to scale. Even worse, the inherent subjectivity in manual labeling often introduces inconsistencies into the training data, further complicating the issue. In this article, we investigate the problem of Zero-Shot Video Moment Retrieval (ZS-VMR) and develop a novel method, Resilient Semantic Pseudo-Text Modeling (RSPT). The core of RSPT is to construct semantically rich pseudo-text embeddings through visually guided perturbations. Specifically, RSPT first generates initial pseudo-texts by injecting random noise into visual features and then learns adaptive noise weights by modeling the correlations between these pseudo-texts and visual features. This enables the generation of diverse and semantically aligned representations from multiple perspectives. To ensure alignment with visual semantics and suppress irrelevant noise, RSPT introduces a quality-aware contrastive loss that regularizes the semantic boundaries of pseudo-texts. Extensive experiments on Charades-STA and ActivityNet-Captions show that RSPT outperforms existing competitive baselines, validating its efficacy. Code is available at .

Added 2026-04-21