Executing Programs Inside Transformers: The Inference Breakthrough Nobody Expected

Researchers at Percepta published a striking claim on March 11, 2026: they have embedded a working program interpreter directly inside transformer weights, capable of executing arbitrary C programs for millions of steps at over 30,000 tokens per second on a CPU. The mechanism—2D attention heads with logarithmic-time lookups—could eliminate a fundamental bottleneck in how AI agents perform deterministic computation.

What Does “Executing Programs Inside a Transformer” Actually Mean?

The standard mental model of an AI using tools to compute is: model generates code → runtime executes it externally → result returns to context. This round-trip adds latency (typically hundreds of milliseconds per call), breaks batch processing, and introduces cold-start penalties on serverless infrastructure.¹

Percepta’s approach, described in their blog post “Can LLMs Be Computers?”, eliminates the external call entirely. Instead of treating computation as a separate tool, the transformer itself performs the execution—encoding a WebAssembly interpreter directly into its weights and running programs as a native inference pass.

The model converts arbitrary C code into tokens, then executes those tokens step by step, producing an execution trace that streams at over 30,000 tokens per second on commodity CPU hardware. The entire process remains internal to the forward pass.

The Attention Complexity Problem This Tries to Solve

To understand why the approach matters, start with what it replaces.

Standard transformer self-attention has O(n²) time and memory complexity in sequence length n.² Every token attends to every other token. This is manageable for short contexts but becomes a wall for long ones—practical limits on current GPUs sit around 16K–32K tokens for training and 50K–100K for inference depending on model size.³

Now imagine executing a program step by step inside that context. Each execution step appends to the trace. After a million steps, the trace is a million tokens long. Under standard attention, each new lookup is a linear scan across the entire context—O(n) per step. With millions of steps, this degrades catastrophically.

The Percepta approach attacks this specific bottleneck. Their 2D attention heads convert those linear scans into logarithmic-time queries using convex hull optimization. Rather than scanning the full execution history to find a register’s last value, the lookup completes in O(log n) time regardless of trace length.

This is the precise sense in which the inference is “exponentially faster”—not that transformers generally became faster, but that the execution trace lookup complexity improved by an exponential factor relative to trace length. For programs with millions of steps, this difference is the gap between feasibility and impossibility.

How 2D Attention Heads Enable Logarithmic Execution

Standard attention heads operate in one dimension: sequence position. Each query matches against keys at every position, producing a softmax-weighted sum over values.

The Percepta architecture introduces a second dimension to the attention structure, enabling the model to index into program state—registers, stack frames, memory addresses—using geometric properties rather than exhaustive search. The convex hull optimization exploits the fact that for ordered data structures like a program’s execution stack, the relevant prior state is always bounded by a convex envelope in the 2D representation. This reduces the search space logarithmically.

The practical result is that the model can maintain correct execution state across millions of steps without the attention computation growing proportionally.

Comparison: Computation Approaches in Modern LLM Systems

Approach	Execution Location	Latency Overhead	Batch-Friendly	Steps Supported	Auditability
External tool call (Python/REPL)	Sandboxed runtime	100–500ms per call¹	No (breaks batching)	Unlimited	High
Code generation → run → return	External runtime	Medium (2+ roundtrips)	Partial	Unlimited	Medium
Chain-of-thought arithmetic	Inside context	None	Yes	Dozens⁴	Low
Percepta internal execution	Inside transformer weights	None	Yes	Millions (claimed)	Medium

The batch-friendliness column is where the Percepta approach has its strongest practical argument. Tool calling inherently interrupts the forward pass, preventing requests from being batched together efficiently. Embedding execution inside the model restores the continuous computation that batching requires. The Anyscale continuous batching research demonstrates that batching optimizations can deliver 10–23x throughput improvements on their own⁵—internal execution could amplify this by removing interruptions.

The Theoretical Foundation: A Decade of Research Converging

This work doesn’t arrive in a vacuum. The question of whether transformers can perform arbitrary computation has been studied systematically since 2019.

A 2024 paper from Google DeepMind and the University of Alberta, “Autoregressive Large Language Models are Computationally Universal”, demonstrated that an unaided LLM can simulate a universal Turing machine using unbounded chain-of-thought reasoning. The catch: “purely theoretical… relies on several idealised assumptions.”⁶

A 2025 paper, “Ask, and it shall be given: On the Turing completeness of prompting”, showed that for any computable function, there exists a prompt making a finite-size transformer compute it—again a theoretical result.

“Chain of Thought Empowers Transformers to Solve Inherently Serial Problems” provides the most practically relevant prior work: CoT gives transformers the ability to perform inherently serial computation that they otherwise cannot. This directly underpins the execution trace approach—each step of program execution becomes a token in the chain.

What Percepta claims to add is the practical machinery to make this fast at scale.

Community Skepticism: What’s Actually Proven?

The Hacker News and Tildes discussions following the March 11 post surfaced real technical concerns that deserve treatment in any honest coverage.

The differentiability problem. The Percepta post claims the process “remains differentiable.” Community reviewers identified this as suspect—hard attention mechanisms of the kind required here are not differentiable with respect to keys and queries. The authors reportedly acknowledged their model isn’t truly differentiable, suggesting differentiable approximations “should work” without experimental evidence.⁷

No training details. The post describes an architecture that “compiles weights directly” rather than using gradient descent. This raises immediate questions: how do you define a loss function for partially correct program execution? What training data distribution enables this? The blog post does not address these questions.

External tools may win anyway. Several practitioners argued that even if internal execution works, external tools provide better observability, independent maintenance, access control, and state management. A well-maintained Python sandbox with efficient batching may beat a novel architecture that no one else can modify or audit.

Arithmetic remains hard for neural networks. A pointed observation in community discussion: “multiplying two 10-digit numbers takes forever” in neural networks. Specialized hardware and symbolic systems remain orders of magnitude more efficient for deterministic arithmetic. The question is whether the internal execution path is actually performing computation or memorizing patterns that look like computation.

Practical Implications for Practitioners

If even a fraction of the claims are reproducible, the implications are significant for specific workloads.

Agent systems with tight loops. Current agentic frameworks that invoke tools thousands of times—code interpreters, calculators, state machines—pay latency costs that stack. Internal execution would collapse these into single forward passes.

Interpretability. If a transformer executes programs with explicit register and stack representations, those intermediate states are visible inside the model. This offers more purchase for mechanistic interpretability than opaque weight activations.

World models and simulation. The Tildes discussion flagged simulation as a core application. A transformer that can execute physics rules or game logic internally could maintain long-horizon world states without context-window limitations.

What to watch for. The work becomes credible when: (a) a full technical paper with training procedures appears, (b) independent groups replicate the execution speed figures, and (c) the model demonstrates generalization—executing programs it was not specifically trained on.

The Bigger Picture: When Inference Becomes Computation

The 2017 attention mechanism paper—“Attention Is All You Need”⁸—introduced a primitive that turned out to be substrate for everything that followed. The comparison Percepta’s team implies is audacious, and probably premature.

But the underlying question is genuine and important: should AI systems call out to computation, or should computation be native to the inference process? The former is how the industry built everything since 2020. The latter may be where a certain class of applications eventually settles—particularly as models are deployed at edge with no reliable external runtime, or in latency-critical agentic loops where each tool call is too expensive.

The Percepta research, regardless of whether its specific claims hold, is a useful forcing function for the field to answer this question rigorously.

Frequently Asked Questions

Q: Does this make all transformer inference faster? A: No. The speedup is specific to programs executing inside the transformer over millions of steps—the logarithmic improvement applies to execution trace lookups, not to NLP or general language tasks.

Q: How is this different from giving a model a Python REPL tool? A: External tools break batching, add network/cold-start latency, and require a separate runtime. Internal execution runs within the forward pass, preserving batch throughput and eliminating roundtrip overhead—at the cost of architectural complexity and auditability.

Q: Can I use this today? A: Not as a production system. The work is currently a blog post with no public code release, no peer-reviewed paper, and no independent replication. It describes a research direction rather than a deployable architecture.

Q: What makes the 2D attention heads different from standard multi-head attention? A: Standard attention heads scan all positions in one dimension (sequence position). The 2D variant adds a second index dimension—program state structure—enabling convex hull optimization to reduce lookup complexity from O(n) linear to O(log n) logarithmic in execution trace length.

Q: What’s the most credible reason to take this seriously? A: The batching efficiency argument. Tool calling interrupts continuous inference, and the overhead compounds at scale. A mechanism that keeps execution inside the model restores the throughput advantages that continuous batching delivers—and that’s a well-documented, real-world problem worth solving.

Sources:

Jha, Amartyajha. “How Poor Tool Calling Behavior Increases LLM Cost and Latency.” DEV Community, 2025. ↩ ↩²
Duman Keles, F., et al. “On The Computational Complexity of Self-Attention.” PMLR, 2023. ↩
Brenndoerfer, Michael. “Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences.” 2024. ↩
Feng, Guhao, et al. “Chain of Thought Empowers Transformers to Solve Inherently Serial Problems.” OpenReview, 2024. ↩
Anyscale. “Achieve 23x LLM Inference Throughput & Reduce p50 Latency.” Anyscale Blog, 2023. ↩
Synced. “Unlocking Turing Completeness: How Large Language Models Achieve Universal Computation Without Assistance.” November 2024. ↩
Community discussion. “Executing programs inside transformers with exponentially faster inference.” Hacker News, March 2026. ↩
Vaswani, Ashish, et al. “Attention Is All You Need.” NeurIPS, 2017. ↩

What Does “Executing Programs Inside a Transformer” Actually Mean?

The Attention Complexity Problem This Tries to Solve

How 2D Attention Heads Enable Logarithmic Execution

Comparison: Computation Approaches in Modern LLM Systems

The Theoretical Foundation: A Decade of Research Converging

Community Skepticism: What’s Actually Proven?

Practical Implications for Practitioners

The Bigger Picture: When Inference Becomes Computation

Frequently Asked Questions

Footnotes

Related Articles

AI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code?

DjVu and Its Connection to Deep Learning: An Unexpected History

Swarm AI for Prediction Markets: Collective Intelligence Gets an Algorithm

Enjoyed this article?