groundy
models & research

Does Tree-of-Thought Reasoning Scale to Billion-User Modeling?

ScaleToT distills tree-of-thought reasoning into a profile encoder that serves billion-user recommenders without per-user LLM calls, lifting LT30 by 6.738% in an A/B test.

8 min · · · 4 sources ↓

Low-activity users are the segment every recommender quietly gives up on. ScaleToT, posted to arXiv on 2026-06-23, runs tree-of-thought reasoning to reconstruct their latent state, but only across a roughly 7% subset, distilling the result into a lightweight encoder that serves the remaining users with no LLM call. In a billion-scale advertising deployment it lifted 30-day lifetime value (LT30) by 6.738% in a randomized online A/B test. The cost claim, though, is comparative rather than absolute, and the paper does not publish per-user latency, QPS, or GPU-hours.

What does ScaleToT actually do for low-activity users?

It ports tree-of-thought reasoning into recommender systems for the exact users where conventional rankers fail: those with no usable interaction history.

The problem is structural. Collaborative filtering needs a user’s past behavior to predict preferences, and low-activity users have little or none. According to the ScaleToT paper, an LLM can infer a latent user state from a static profile, but that reasoning becomes unreliable when the profile is sparse, and applying an LLM to billions of users is prohibitively expensive. Two constraints collide in the same place: the users who most need richer inference are the ones with the least signal, and the inference itself does not extend to the population.

Tree-of-thought reasoning is the tool ScaleToT reaches for. Tree of Thoughts (Yao et al., NeurIPS 2023) generalizes chain-of-thought by exploring multiple reasoning paths with self-evaluation, lookahead, and backtracking, rather than committing to a single chain. The headline result from that work: on Game of 24, ToT lifted GPT-4 from 4% under chain-of-thought to 74%. Self-evaluation and lookahead matter here precisely because there is no ground-truth behavior to anchor on. Where a dense-history user can be modeled by counting what they clicked, a low-activity user has to be reasoned about: the model proposes plausible states, tests them against the thin profile, and keeps the ones that fit. ToT’s branching search is, in effect, that proposal-and-test pass done before any encoder ever serves.

How does it amortize reasoning cost across a billion users?

The architectural move is to run expensive ToT reasoning on roughly 7% of users and serve the other ~93% from a distilled encoder with no LLM inference.

The pipeline has three stages. First, ScaleToT constructs typed user-state chains using a bounded, entropy-guided tree-of-thought refinement procedure over a small LLM-processed subset. Second, those teacher-curated chains train a student model on static profiles through supervised fine-tuning and OSIPO, which the paper defines as Outcome-Driven Segment-Aware Implicit Reward Policy Optimization. Third, the student’s learned representations transfer to a lightweight profile encoder that serves the remaining users with no LLM inference at request time.

The last step is the one that matters. The LLM and its multi-step reasoning exist only during training; at serve, the encoder carries their distilled output. That is the only mechanism by which ToT reaches billion-user populations at all. ToTRL makes a related structural point: ToT’s advantage over chain-of-thought is partly that it evaluates multiple branches in parallel and prunes unproductive paths, which can cut the verbose introspection that makes trial-and-error CoT expensive in tokens. ScaleToT bets that the same pruning, once distilled, also shrinks the inference footprint at serve.

What did the A/B test show?

In a billion-scale advertising deployment, ScaleToT lifted 30-day lifetime value (LT30) by 6.738% in a randomized online A/B test, evaluated on lifetime value prediction.

That is an industrial result, not a public benchmark. The ScaleToT paper reports the LT30 figure from a randomized online test in a live advertising system, which is a stronger signal than an offline metric but harder for an outside team to reproduce or audit. There is no second deployment, no public dataset, and no independent replication reported as of 2026-06-26. Treat 6.738% as a single-vendor, single-system number: evidence that the approach clears a production bar, not a transferable accuracy figure.

The framing matters too. LT30 is a forward-looking value estimate, so the test measured whether reconstructed states produced users worth more over the following month, not whether they clicked more in the session the model scored. A randomized online A/B is the right evidence bar for that question, because offline metrics on lifetime value tend to be dominated by label leakage and selection effects. A 6.738% LT30 lift is consistent with the paper’s thesis that the missing signal lives in the low-activity tail. It does not, on its own, say anything about click-through rate, retention, or short-horizon ranking, where the distilled encoder competes against baselines that are already tuned.

Is the cost claim verifiable from the paper?

No. The paper’s cost claim is comparative, and the absolute throughput is not published.

According to the ScaleToT abstract, offline reasoning covered only 7.32% of the potential population, “greatly reducing compute cost compared with full-population reasoning.” That is a claim against a naive baseline of running the LLM’s ToT reasoning across every user. It is not a claim about per-user latency, queries per second, GPU-hours, or the encoder’s serve footprint once fleet replication and tail-latency budgets are included. None of those figures appear in the abstract.

This is the gap a team budgeting around the result has to close. An independent infrastructure analysis, Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective, finds that multi-step and test-time reasoning improves accuracy with more compute but suffers rapidly diminishing returns, widening latency variance, and what it calls unsustainable infrastructure costs, with agents often performing dozens of inference calls per request. That is the failure mode ScaleToT’s distillation has to escape. The training run is bounded by the 7.32% subset; the risk is any residual per-request reasoning that leaks into the serve path.

What does it mean for cold-start and sparse-signal ranking?

If the distillation holds under real serving load, ScaleToT is a template for applying expensive structured LLM reasoning to cold-start and sparse-signal users without paying per-user inference cost.

The pattern, not the specific encoder, is what is portable. Run structured reasoning where it is affordable, on a bounded subset; distill it into a cheap model; serve the cheap model everywhere. That decouples the cost of multi-step inference from the size of the user base, which is the coupling that has kept LLM reasoning out of recommender systems in the first place. For cold-start users specifically, where collaborative filtering has no signal and content-based rankers thin out, a profile encoder trained on reconstructed states is a structurally different lever than another ranking feature bolted onto an existing model.

Most published ToT work stays in the reasoning-benchmark lane, on puzzles like Game of 24 or creative-writing scoring, or on cultivating ToT behavior inside a single model via reinforcement learning, as ToTRL does. ScaleToT is unusual in moving ToT into an industrial user-modeling stack. There the binding constraint is fleet-wide affordability, not isolated reasoning quality, and the distillation pattern is what makes that translation possible. It is the part most worth lifting.

The structural caveat comes from the same infrastructure work. arXiv:2506.04301 flags diminishing returns and widening latency variance for naive multi-step reasoning, and ToTRL notes that even well-behaved ToT only “potentially” reduces token cost versus chain-of-thought. ScaleToT sidesteps the per-request version of that problem by distilling, but the claim that it does so cheaply rests on the 7.32% subset figure and an unverified serving footprint. The honest read is conditional: if the encoder’s representations survive the transfer to a live fleet, the cold-start cost problem looks solvable; if they degrade, the system is back to paying for LLM reasoning on the users who can least justify it.

Frequently Asked Questions

How is ScaleToT’s use of tree-of-thought different from ToTRL’s?

ToTRL cultivates tree-of-thought behavior inside a single reasoning model, Qwen3-8B, using reinforcement learning on puzzle tasks, so the model itself reasons better at inference. ScaleToT does the opposite: it runs ToT once, offline, over a bounded subset, then discards the reasoning model and serves a distilled encoder that never branches at request time.

Does the distillation pattern help users who already have rich interaction histories?

The gain is concentrated where it is designed to be. OSIPO is segment-aware, so its reward policy is tuned per user segment and the lift lands in the low-activity segments rather than in users whose histories already give standard collaborative filters enough signal. For high-activity users, running reconstruction adds overhead without a segment to recover.

What happens if the student-to-encoder transfer degrades under serving load?

Because OSIPO is outcome-driven, rewarded on 30-day lifetime value rather than clicks, a degraded encoder can still emit scores that look internally consistent while carrying no usable signal, and there is no per-session ground truth to catch it. The team typically learns the transfer failed only when LT30 regresses, roughly a month after the encoder went live.

What input does ScaleToT need even for users with zero interaction history?

A populated static profile, not a bare identifier. The teacher’s entropy-guided ToT refinement reasons over profile features, so even zero-history users must carry attributes for the encoder to reconstruct a state. Populations with large anonymous or signed-out segments that collect no profile attributes fall outside what the pattern can serve.

Has the 6.738% LT30 lift been replicated in a second deployment?

No replication exists as of 2026-06-26; the figure comes from a single vendor’s single advertising system. The more useful extrapolation is that a second team’s LT30 delta should track the share of low-activity users in its own population, because the encoder was distilled for the cold-start segment, so fleets with fewer sparse-signal users have less tail to recover and should expect a smaller lift.

sources · 4 cited