TextCoT: Zoom-In for Enhanced Multimodal Text-Rich Image Understanding
The advent of Large Multimodal Models (LMMs) has fueled extensive research due to their sophisticated reasoning capabilities. However, for understanding text-rich images, challenges persist in fully leveraging the potential of LMMs, and existing methods struggle with effectively processing high-resolution images. Addressing this, we introduce TextCoT, a training-free Chain-of-Thought framework that improves text-rich image understanding by leveraging LMMs’ captioning abilities for global context and detailed local textual analysis. TextCoT comprises three stages: Global Context Generation, Macro-Scale Positioning, and Fine-Grained Visual Inspection—each contributing to a comprehensive understanding and precise information extraction needed for accurate question-answering. Our method requires no additional training, offering immediate plug-and-play functionality. We have demonstrated TextCoT’s effectiveness and adaptability across various benchmarks. The source code is available at .
Added 2026-04-21