Beyond Words: New Study Reveals AI’s Surprising Strengths (and Failures) in Spatial Reasoning
Researchers from SenseTime Research and the S-Lab at Nanyang Technological University have presented a comprehensive technical report on the progress of multimodal models in spatial perception and reasoning. Their evaluation drew upon eight state-of-the-art benchmarks and consumed more than one billion tokens in testing.
To unify disparate experiments under a common standard, the authors proposed a framework of six fundamental spatial competencies: metric estimation, mental reconstruction, spatial relations, perspective shifting, deformation and assembly, and integrated reasoning.
This framework enabled a consistent approach to evaluation, allowing models to be compared on shared ground. Each category in the paper is accompanied by references to seminal concepts, including spatial intelligence, mental rotation, and chain-of-thought reasoning.
The study standardized system prompts, response templates, and metrics. For multiple-choice questions, the researchers employed Chance-Adjusted Accuracy (CAA) to offset the effect of random guessing. For numerical tasks, they introduced Mean Relative Accuracy (MRA), which assesses precision within acceptable margins of error.
Against its competitors, GPT-5 emerged as the clear leader. In subtasks involving distance estimation and the understanding of spatial arrangements, its performance reached near-human levels. The model significantly outperformed Gemini-2.5-Pro and the entire InternVL series. However, in more complex domains — such as mental object assembly, perspective transformations, or action simulations — the gap with human cognition remains substantial.
Intriguingly, in the most challenging cases, closed-source models like GPT-5 hold no decisive advantage over open-source rivals, marking these domains as particularly promising for the broader research community.
Special attention was given to the model’s modes of reasoning. The more “thinking tokens” the system expends, the more accurate its responses become — though only up to a threshold. Excessive deliberation often results in timeouts or truncated answers. The most balanced results were achieved with moderate reasoning depth.
The researchers also examined resilience to positional bias in multiple-choice testing. Under “strict circular reshuffling,” where the correct answer must be identified regardless of option placement, accuracy dropped sharply, highlighting a lingering sensitivity to positional effects.
Overall, the study underscores a pivotal shift: models now handle fundamental tasks requiring size estimation and spatial arrangement with confidence. Yet, when true three-dimensional imagination, mental reconstruction, and spatial logic are required, GPT-5 still falls short of human capability. Spatial intelligence thus remains one of the most formidable and fascinating frontiers for artificial intelligence.