Relationship-Experts Transformer for Image Captioning
Image captioning is a cross-modal text generation task aimed at understanding the relationships among various objects in an image. Therefore, accurately expressing object–object relations remains a key bottleneck for transformer-based image captioning. Prior methods usually inject semantic and geometric relations once and keep them fixed while only updating visual features, creating a mismatch—evolving visuals vs. frozen relations—that weakens relational guidance and leads to feature entanglement. We propose the Relationship-Experts Transformer (RET), which treats semantic and geometric relations as learnable experts that guide object visual features (students) and co-evolve with them. In RET, we first design the Relationship-Guided Feature Aggregation (RGFA) module, which is analogous to experts-guided student learning, specifically utilizing the relationship kernel (the expert’s knowledge brain) to guide the learning of the object visual features (students). Secondly, we develop the Experts Knowledge Updating (EKU) module, which continuously iterates expert knowledge during training to enhance the expert’s guiding ability over the student. Finally, we design the Student Knowledge Selector (SKS) module to adaptively select object visual features enhanced with different relations under the guidance of semantic and geometric experts to generate descriptive texts embodying semantic and geometric knowledge. Experiments on the MSCOCO dataset demonstrate that our model achieves state-of-the-art performance. All codes are available at .
Added 2026-04-21