Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Abstract
Chain-of-Thought prompting in multimodal reasoning models degrades performance in visual spatial reasoning due to shortcut learning and hallucination of visual details from text alone.
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.
Community
This paper reveals a surprising finding: Chain-of-Thought reasoning actually hurts performance on visual spatial tasks, with a comprehensive evaluation of seventeen models across thirteen spatial benchmarks showing consistent degradation under CoT prompting. The "No-Image++" ablation exposes a deeper problem, models are hallucinating visual details from textual priors rather than truly reasoning over images, making a strong case that text-only CoT is insufficient and vision-centric reasoning paradigms are needed.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization (2026)
- PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment (2026)
- Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning (2026)
- PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues (2026)
- Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models (2026)
- EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs (2026)
- Learning Adaptive Reasoning Paths for Efficient Visual Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.16060 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper