Why LLMs Fail at Spatial Reasoning When Planning Navigation

Large language models generate fluent route descriptions but stumble when asked to plan those routes. Two papers posted on arXiv on May 29, 2026 give independent structural reasons why: the model’s internal representation of spatial search states is implicit at best, and when physical prediction gets hard, the model’s default move is to predict nothing happened. Both point to training-data distributions as the root cause, not prompt design, which means no amount of chain-of-thought scaffolding fixes the underlying gap.

LinTree (arXiv:2605.31492), submitted by Liwei Kang, tests LLMs across three reasoning environments, one of which is grid Navigation. The finding: giving a model raw access to its search history is not enough to reliably beat heuristic search. The reason is structural. When an LLM backtracks or switches branches during spatial planning, its reasoning trace does not explicitly mark which earlier search state it is revisiting. The trace is a flat sequence of tokens; the underlying search tree exists only implicitly in the model’s attention patterns.

This matters because navigation planning is tree-shaped. A working planner needs to know where it has been, which branches are exhausted, and which paths remain open. A language model emitting a linear sequence of “thought” tokens has no native mechanism for bookkeeping that tree structure. The model might luck into the right answer on simple grids, but as the branching factor grows, the implicit representation degrades.

The LinTree fix is straightforward: inject explicit parent pointers into the prompt so the model operates on a linearized tree rather than a flat history. Kang reports that this improved both task performance and search efficiency relative to both implicit-reasoning models and LLM-heuristic-guided search in the Navigation environment.

Stasis bias: when the model gives up

BillardPhys-Bench (arXiv:2605.30900), submitted by Ben Wang, evaluates multimodal LLMs from the GPT, Claude, Gemini, and Qwen families on physical-reasoning tasks. Two findings are relevant here.

First, performance drops as simulation time increases and scene geometry grows more complex. This is unsurprising on its own, but the shape of the failure is not random. The paper documents a consistent failure mode it calls stasis bias: when the correct physical outcome is harder to infer, models tend to predict that no interaction occurs at all. The ball stays still. The collision never happens. The model defaults to “nothing changed.”

This is not an accuracy problem that averages out. It is a directional bias. A planner that systematically underestimates physical interaction will produce paths that assume the world is more static than it is. For embodied agents operating in dynamic environments (warehouses with moving obstacles, streets with pedestrians), stasis bias is a specific, dangerous failure mode, not a generic “the model got it wrong.”

Second, the paper frames this as evidence of missing physical inductive biases in multimodal architectures. The training data contains far more descriptions of static scenes than of precise physical dynamics, so the model’s prior favors stasis. This is the same mechanism the LinTree work exposes from a different angle: the model inherits its spatial and physical expectations from the distribution of its training text.

The training-data hypothesis

Both papers converge on the same explanation, though neither states it in identical terms. The spatial and physical blind spots are not quirks of a particular architecture or a deficiency that more parameters will smooth away. They reflect what text looks like.

Written language describes space poorly compared to how spatial agents experience it. A training corpus heavy on Wikipedia articles and code comments contains millions of sentences about what things are and relatively few that precisely encode where things are relative to each other in a navigable coordinate system. The Wikipedia article on large language models itself notes that biased or inaccurate training data makes output less reliable, a general observation that applies with particular force to spatial relations.

OmniMatBench (arXiv:2605.29833) provides a converging data point from a different domain: the best multimodal LLM it tested scored 0.372 overall on expert-curated reasoning problems, with the paper citing “fixed reasoning heuristics and limited high-level knowledge application.” The domain is materials science, not navigation, but the pattern is the same: a model trained on general text hits a ceiling when the task demands domain-specific relational reasoning that the training distribution underrepresents.

What this means for embodied agents and routing systems

Teams wiring LLMs into embodied agents, warehouse robots, or routing planners need to treat the model’s spatial output as unreliable by default. Not because the model is bad at language, but because language is bad at space.

The practical implications are structural rather than incremental:

Do not expect prompt engineering to close the gap. If the spatial bias is baked into the training distribution, no amount of few-shot exemplars or chain-of-thought scaffolding creates a reliable spatial planner. The LinTree results support this: the only intervention that worked was changing the input representation (adding parent pointers), not changing the prompt instructions.
External geometric scaffolding is not optional. A separate spatial representation layer (a graph, a coordinate system, a physics simulator) needs to own the geometry. The LLM can operate on top of that structured input, but it cannot substitute for it.
Test for directional biases, not just accuracy. BilliardPhys-Bench’s stasis bias would not show up in an aggregate accuracy metric if the test set is balanced. You need to specifically measure whether the model’s errors cluster in one direction (under-predicting interaction, over-predicting distance, systematically missing cardinal directions). An error budget that treats spatial failures as symmetric will miss the real risk.

The fix is structural, not linguistic

LinTree’s parent-pointer result is the most actionable finding here. By converting an implicit search history into an explicit tree structure that the model can read, the researchers got measurable improvements in both planning accuracy and search efficiency in the Navigation environment. The intervention does not require a different model, more parameters, or a longer prompt. It requires a different input format that makes the spatial structure legible to a system that processes tokens sequentially.

This is the pattern that teams should replicate. When the model’s training data lacks the representational density to support a task, the fix is to externalize the representation. For spatial planning, that means graph structures, coordinate transforms, and explicit state tracking. For physical prediction, it means coupling the LLM to a simulator rather than asking it to be one.

The two papers are less than a week old as of this writing. Neither has been through peer review. The direction they point is consistent and, for anyone building spatial agents, uncomfortable: the spatial reasoning gap is a property of the training-data distribution, and closing it requires external structure, not better prompts.

Frequently Asked Questions

LinTree validates parent pointers only on discrete grid Navigation, one of three tested environments, and does not report equivalent gains across the other two. Continuous domains like outdoor robotics introduce sensor noise, partial observability, and non-discrete state transitions that do not map to a linearized tree of nodes. Transferring the approach would require a fundamentally different state representation.

Is stasis bias a problem for text-only LLMs or only for multimodal models?

BilliardPhys-Bench evaluates multimodal LLMs exclusively, so stasis bias is confirmed only for models that process visual and physical-scene inputs alongside text. Whether a text-only LLM exhibits the same bias when reasoning about physics from verbal descriptions alone is untested. Teams choosing between a multimodal planner that ingests camera feeds and a text-only planner operating on symbolic scene descriptions face different risk profiles: the former has a documented directional bias, while the latter’s spatial error distribution is unknown.

What should teams benchmark LLM spatial planners against before trusting them in production?

LinTree’s finding that implicit-reasoning LLMs did not reliably beat heuristic search establishes a concrete floor. Before deploying an LLM-based spatial planner, teams should run it against A* search, potential-field planners, or sampling-based methods like RRT on domain-specific tasks. If the LLM does not clear that heuristic baseline, it adds latency and computational cost without accuracy gains.

Would training LLMs on more physical-simulation data eliminate the spatial-reasoning gap?

OmniMatBench’s top score of 0.372 across 19 expert-curated materials-science subfields suggests that domain-specific training data alone does not close the reasoning-ability gap. The bottleneck is the model’s capacity to apply relational reasoning heuristics flexibly across unfamiliar problem structures, not merely exposure to domain text. Closing the spatial gap likely requires architectural changes such as explicit spatial reasoning modules or neurosymbolic hybrids rather than simply ingesting more simulation data.