2026 text-video retrieval;latent factors;granularity;computational efficiency.

Disentangled Concept Matching for Text-video Retrieval through Perception Imitation

Zhang, Wei and Jin, Peng and Zou, Chunyu and Peng, Han and Zhang, Ziyao and Chen, Jie and Gao, Wen

Text-video retrieval plays a pivotal role in cross-modal tasks, aiming to match textual descriptions with corresponding video content accurately. Existing methods often employ fine-grained feature matching to improve retrieval accuracy, but such approaches consume extensive computational resources. Conversely, coarse-grained feature matching between entire sentences and videos offers computational efficiency but may overlook the heterogeneous semantic concepts embedded within the data. To overcome these challenges, we develop the Disentangled Concept Matching (DCM) framework, designed as an imitation of human semantic perception processes. The framework utilizes Disentangled Representation Learning (DRL) to divide coarse-grained features into distinct semantic concepts represented as latent factors, effectively generating finer-grained features while reducing computational demands. To improve the accuracy of retrieval, we first propose the Composed Spatial-temporal Module (CSTM) to optimize the quality of multimodal feature extraction. Utilizing a branch-structured temporal modeling approach, CSTM effectively enhances the DCM model’s comprehension of video content and temporal information, leading to the extraction of refined video features. Second, building on the optimized features, we propose the Adaptive Pooling Module (APM) to measure the confidence level of each latent factor matching during the process of decoupling concepts. APM enhances the fidelity of text and video concepts, thereby further ensuring the accuracy of matching after decoupling. With CSTM and APM, DCM accurately matches latent factors in lower dimensions, achieving significant improvements in computing efficiency and retrieval performance. Our experimental evaluations across standard datasets, namely MSR-VTT, LSMDC, MSVD, ActivityNet, and DiDeMo, demonstrate that the DCM framework achieves state-of-the-art performance, with Recall@1 scores of 48.7\%, 25.6\%, 48.4\%, 45.0\%, and 48.6\%, respectively. Compared to our previous model, the DCM framework shows improvements of 2.54\%, 0.08\%, 2.11\%, 6.89\%, and 6.35\%, respectively.

Added 2026-04-21