2026 multimodal large language model;chain-of-thought;versatile scene understanding

TextCoT: Zoom-In for Enhanced Multimodal Text-Rich Image Understanding

Luan, Bozhi and Feng, Hao and Chen, Hong and Wang, Yonghui and Zhou, Wengang and Li, Houqiang

The advent of Large Multimodal Models (LMMs) has fueled extensive research due to their sophisticated reasoning capabilities. However, for understanding text-rich images, challenges persist in fully leveraging the potential of LMMs, and existing methods struggle with effectively processing high-resolution images. Addressing this, we introduce TextCoT, a training-free Chain-of-Thought framework that improves text-rich image understanding by leveraging LMMs’ captioning abilities for global context and detailed local textual analysis. TextCoT comprises three stages: Global Context Generation, Macro-Scale Positioning, and Fine-Grained Visual Inspection—each contributing to a comprehensive understanding and precise information extraction needed for accurate question-answering. Our method requires no additional training, offering immediate plug-and-play functionality. We have demonstrated TextCoT’s effectiveness and adaptability across various benchmarks. The source code is available at .

Added 2026-04-21