LACE Forces vLLM and SGLang to Rethink How Parallel Reasoning Threads Run

LACE replaces the assumption that parallel reasoning threads should run in isolation with a 2-D lattice attention mechanism that lets them share intermediate results mid-inference. Yang Li et al.’s April 16 arXiv submission demonstrates accuracy gains of 3 to 7.5 points on math and reasoning benchmarks, but capturing those gains in production requires inference engines to abandon the independent-sequence batching model that vLLM, SGLang, and similar runtimes are built around.

The Independent-Thread Assumption That vLLM and SGLang Are Built On

Modern inference engines optimize throughput by treating each parallel generation thread as an independent sequence. Batching kernels, attention memory layouts, and scheduling heuristics all assume that tokens generated in thread A never influence thread B before the final voting or ranking stage. This design choice maximizes GPU utilization and keeps memory access patterns regular, but it also hard-codes a specific theory of how reasoning should work: generate many candidates in isolation, then aggregate. LACE contradicts that theory directly.

How LACE’s 2-D Lattice Attention Works

The paper generalizes standard 1-D causal attention with a width dimension, creating what the authors call 2-D Lattice Attention. Instead of tracking only token position, the model splits head dimensions between token position (d_t) and block or thread index (d_b) using 3D RoPE. The result is that concurrent reasoning paths can observe and correct each other during inference rather than only at the end.

The architecture preserves the standard causal-attention backbone but adds reshape operations and 3D RoPE that would require custom kernel support in existing inference engines. The authors implemented the system in bfloat16 on an RTX PRO 6000 Blackwell 97GB GPU.

The Accuracy Gains (and Where They Vary)

The accuracy improvements are substantial but not uniform across benchmarks. On AIME 2024, LACE-4B scored 20.0% against 10.0% for Independent+Voting, a gain of 7.5 points. On AIME 2025 the margin narrowed to 16.7% versus 13.3%, a 3.4-point improvement. LiveBench showed a 5.0-point gain, with LACE-4B at 33.0% versus the baseline’s 28.0%.

The Isolated Parallel baseline, which uses the same format but no lattice layers, collapsed to 0.0% after supervised fine-tuning and only reached 3.3-14.0% after reinforcement learning. The authors use this collapse to argue that the lattice architecture is essential, though it may also reflect that the synthetic training data is specifically designed for lattice attention.

The Overhead: Latency, Memory, and Training Cost

The gains come with measurable costs. Parameter overhead is less than 11% of the original model; FLOPs overhead is under 1.3%. Step latency increases more sharply, by +38.5% for four threads at 1.7B parameters and +31.2% at 4B parameters. Memory overhead for four threads falls between roughly 15% and 22%. A single GPU fits 128 threads at approximately 12.3 GB for the 1.7B model or 22.5 GB for the 4B model.

Training data requirements are small relative to typical pre-training: 800 supervised fine-tuning questions and 1,959 reinforcement learning questions for the 1.7B model; 2,409 SFT and 6,474 RL questions for the 4B model.

Why Inference Frameworks Will Need to Care

If LACE or similar architectures gain traction, inference schedulers face a fundamental redesign. Current batching systems in vLLM and SGLang assume that sequences in a batch do not attend to each other. Lattice attention breaks that assumption by design. Supporting it would require new kernels that handle the 3D RoPE and reshape operations, along with scheduling logic that keeps related reasoning threads resident on the same GPU simultaneously so they can share state.

The paper leaves this gap deliberately. It is a research artifact demonstrating what is possible, not a framework patch showing how to deploy it. For teams maintaining inference infrastructure, the open question is whether the 3-7 point accuracy improvement justifies the engineering cost of custom kernels and altered memory layouts.

Bottom Line for Teams Running Self-Hosted Inference

LACE proves that parallel reasoning threads can productively share state mid-inference, which challenges the throughput-first isolation model baked into modern inference engines. The accuracy gains are real on math benchmarks, the overhead is bounded and documented, and the training data requirements are modest. What is missing is the bridge to production: custom kernel support, framework integration, and a clear answer to whether the same lattice structure transfers to reasoning domains beyond the paper’s math-heavy evaluation set.

Frequently Asked Questions

Do I need to retrain my model to use LACE, or can I add it to an existing checkpoint?

LACE requires a three-stage training pipeline: continuous pre-training on multi-thread data, supervised fine-tuning with random thread shuffling, and reinforcement learning via Lattice GRPO with thread-aggregated rewards. It cannot be applied to an existing checkpoint without retraining.

How does LACE’s accuracy compare to simply running parallel threads and voting on the outputs?

On AIME 2024, LACE-4B scored 20.0% versus 10.0% for Independent+Voting, a 7.5-point gain. However, the Isolated Parallel baseline that uses the same format without lattice layers collapsed to 0.0% after supervised fine-tuning, suggesting the lattice structure is essential for the approach.

What infrastructure changes would be needed to run LACE in a vLLM or SGLang deployment?

Inference engines would need custom kernels to handle 3D RoPE and reshape operations, along with scheduling logic that keeps related reasoning threads resident on the same GPU so they can share state. This breaks the independent-sequence batching assumption both frameworks currently rely on.

What are the latency and memory overheads of running LACE with multiple threads?

Step latency increases by 38.5% for four threads at 1.7B parameters and 31.2% at 4B parameters. Memory overhead for four threads is roughly 15-22%, and a single GPU can fit 128 threads at approximately 12.3 GB for the 1.7B model or 22.5 GB for the 4B model.

Is LACE available as a production-ready library or plugin?

No. The paper is a research artifact with no framework integration or production-ready kernel implementation as of April 2026. The authors’ implementation runs on an RTX PRO 6000 Blackwell 97GB GPU using custom bfloat16 code.