2026 person re-identification;vision-language learning;infrared person re-identification

Text-Guided Cross-Modal Alignment with Attribute and Contour Prototypes for Visible-Infrared Person Re-Identification

Tao, Yong and Zhang, Xinming

Visible-infrared person re-identification (VI-ReID) aims to match pedestrian images captured under visible and infrared modalities, which suffer from significant domain discrepancies. Existing approaches either synthesize cross-modal images or learn modality-invariant representations, yet often encounter semantic degradation or limited alignment capacity. Recent vision-language models leverage textual semantics for modality bridging; however, CLIP-based frameworks typically rely on learnable token proxies with limited expressiveness. In this article, we propose a novel semantic-driven framework that explicitly generates rich, modality-agnostic textual descriptions from images as alignment cues. Specifically, we design a dual-branch Text Semantic Generation Module that includes: (1) an Attribute-Aware text description Generation module using prompt-based templates to capture modality-invariant identity cues, and (2) a Contour-Aware text prompt Module that provides complementary structural information often missing in textual descriptions. To reconcile semantic heterogeneity, a Text Re-definition Module (TRM) fuses instance-level and class-level semantics into unified representations, enabling fine-grained alignment with image features. Furthermore, we construct category-level textual prototypes as global semantic anchors to enhance cross-modal consistency. Extensive experiments on two standard VI-ReID benchmarks demonstrate that our method achieves superior performance, validating its effectiveness in semantic-guided modality alignment.

Added 2026-04-21