Meituan's General 365 Benchmark: Top Models All Score Under 63%

Meituan’s LongCat team released General 365 on June 21, 2026, a manually curated reasoning benchmark that ran 26 frontier models and found the leader, Gemini 3 Pro, reaches only 62.8% accuracy. The vast majority of the field failed to clear the benchmark’s self-defined 60% passing line, a result that sits well below what saturated public leaderboards have conditioned buyers to expect from the same class of model.

That a brand-new eval leaves the entire frontier in the low 60s and below, instead of the comfortable range these models post on established reasoning evals, is the headline. The mechanism behind the deflation is the part worth reading for.

How General 365 isolates reasoning from memorization

General 365’s defining design choice is restricting the required knowledge to K-12 scope, so the benchmark scores reasoning rather than how much a model has memorized. That single restriction is the main reason the numbers come in low.

The benchmark ships with 365 manually crafted, deliberately diverse seed problems, expanded into 1,095 variants by altering surface semantics or constraints while preserving the core reasoning skill being tested. The seeds span 8 challenging categories, detailed in Section 2.1 of the paper. “General knowledge” is explicitly defined as common sense, fundamental linguistics, and basic subject matter, with university-level academic knowledge excluded. The stated goal is to decouple a model’s reasoning capability from its knowledge dependence, which in turn reduces the advantage that pure memorization confers on saturated evals.

That framing is the methodological lever worth paying attention to. Most reasoning benchmarks quietly conflate two questions: can the model reason, and does the model already know the relevant domain fact. When a benchmark is old and widely crawled, the second question dominates and scores inflate, because retrieval is being graded as if it were inference. General 365 attacks the second question directly by keeping the knowledge trivial and forcing the reasoning to carry the entire load.

The variant expansion reinforces the same point from the other direction. Expanding 365 seeds into 1,095 items by varying surface form while holding the underlying reasoning fixed is a direct counter to pattern-matching. A model that has memorized a specific solution template for one phrasing has to re-derive it when the phrasing shifts, which is closer to what “reasoning” is supposed to mean than reproducing a known answer.

Who passed the 60% line?

Only Gemini 3 Pro cleared 60%, at 62.8%; the rest of the 26-model field fell below the passing threshold the LongCat team set for the benchmark.

That 60% line is the team’s own designation, not an externally agreed standard, and it is doing real rhetorical work in the writeup. Framing the result as “most models fail” depends entirely on where you draw the line, and a benchmark author gets to draw it wherever flatters the story. Strip the “pass/fail” framing out, though, and the underlying data is still striking: the strongest model in the field sits less than three points above the self-declared pass mark, and most of the field sits below it. On a benchmark whose knowledge ceiling is K-12, that is a pointed result about reasoning, not recall.

Per-model scores for the remaining 25 models could not be confirmed against the primary report, and the public leaderboard does not yet surface reproducible figures for them. The rank order behind Gemini 3 Pro is, for now, Meituan’s own claim.

The compressed spread is still the more durable observation. A benchmark where the top model sits at 62.8% and most of the field falls below 60% is telling you something different from a benchmark where the top model clears 90% and tenth place trails by 30 points.

Why a fresh benchmark deflates every score

A newly built, held-out eval almost always scores models lower than the public leaderboards do, and General 365 is a textbook case of that effect working as intended.

The contamination control is concrete rather than rhetorical. Meituan released only half of the total questions publicly and kept the remaining half as a held-out test set specifically to track contamination creeping into the open-source portion. That is the right instinct. Public benchmarks degrade as their items get scraped into pretraining and fine-tuning corpora; a held-out slice is the cheapest credible defense, and it is the reason a fresh eval’s scores are more informative than a stale one’s even when the older benchmark is nominally harder on paper. General 365 even builds in a way to detect its own decay, which is more than most public evals offer.

The wider context is how saturated the leaderboard market has become. Third-party aggregators now track hundreds of frontier models across hundreds of public benchmarks, and when a field that crowded all trains against the same few evals, those evals stop measuring capability and start measuring exposure to those specific evals. General 365’s compressed spread, with the entire frontier packed into a narrow band, is exactly what you expect to see when you strip the memorization advantage back out.

The deflation is not a bug in the models. It is a correction in the ruler, and a reminder that a leaderboard’s age and exposure level are load-bearing parts of any score it reports.

Are these numbers trustworthy?

General 365 is self-reported by the benchmark’s own creators, so every number is a vendor-side claim until a third party re-runs it. That caveat carries more weight than usual here because the benchmark comes from Meituan’s own LongCat team, which has a natural interest in results that make a new evaluation look necessary.

Two things work in favor of credibility. First, the scoring pipeline is engineered for reproducibility: a hybrid rule-based and model-based scoring algorithm with a manually verified scoring accuracy of 99.6%. The dataset and code ship publicly, so nothing about the method is secret; anyone with API access can re-run the public half. Second, the held-out contamination set gives the creators a mechanism to detect, after release, whether the public half is leaking into training data, which is a structural safeguard most public benchmarks lack.

The honest assessment is that General 365 looks like a well-built benchmark from a credible industrial lab, and well-built benchmarks from credible labs still deserve independent confirmation before their rank order gets quoted in a procurement document. The 99.6% scoring-accuracy figure, in particular, is self-verified by the same team that produced the scores it is meant to validate, which is the kind of circularity a neutral reproducer exists to break.

What does this mean for procurement?

For teams buying models, the lesson is structural rather than model-specific: never anchor a decision on a single headline score, and weight fresh, contamination-guarded evals more heavily than saturated public leaderboards, especially when the spread between the top models collapses to a few points.

A compressed spread is itself the actionable signal. When the top of the field lands within a handful of points on a clean eval, reasoning quality is no longer a useful differentiator at the frontier for this category of task. Procurement should pivot to the dimensions that actually separate these models: latency, price per million tokens, context handling, tool-use reliability, and the legal and operational profile of the provider. A benchmark gap of just a few points should not survive contact with a pricing sheet.

The deeper point General 365 makes, intentionally or not, is that the public reasoning-leaderboard complex has been quietly overcrediting the field. When the same models that dominate saturated leaderboards drop into the low 60s under K-12-only knowledge and a held-out test set, the gap between those two numbers is a direct measure of how much of the old score was reasoning and how much was recall. For anyone whose job is to pick a model on the basis of what it can actually do rather than what it has already seen, that gap is the most useful number in the whole release, and it stays useful long after the specific June 2026 rank order has gone stale.

Frequently Asked Questions

How does General 365’s narrow result band compare to the spread on public aggregators like BenchLM?

BenchLM currently tracks 238 models across 223 benchmarks, and on those saturated public evals the top three providers, Anthropic, OpenAI, and Google, separate by margins wide enough to drive vendor choice. General 365 compresses the entire 26-model frontier into roughly the 58 to 63 percent band, which is what makes its rank order uncomfortable rather than confirmatory for anyone reading both leaderboards side by side.

What breaks if the held-out half of General 365 itself leaks into training data?

The held-out set only flags contamination in the public half; nothing in the design re-checks the held-out items once they have served that detection purpose. General 365 therefore inherits the same finite shelf life as the saturated benchmarks it criticizes, and its contamination signal weakens each time the held-out slice is reused to validate a new model release.

Can an internal team reproduce the full General 365 leaderboard today?

Only half of the items ship on GitHub, so a third-party re-run can score models on the released portion but cannot reproduce the exact headline figures, which are computed over the combined public and held-out sets. Matching the reported 62.8 percent for Gemini 3 Pro specifically requires the unreleased test set that Meituan retains.

Does General 365 measure coding ability the way SWE-bench or HumanEval do?

No. General 365 restricts required knowledge to K-12 scope, which excludes the production-codebase and framework-specific context that SWE-bench evaluates against, so its categories capture general reasoning rather than software-engineering proficiency. Teams screening models for coding tasks should treat General 365 as a reasoning sanity check rather than a readiness signal for repository-level work.

How long can the K-12 scope stay discriminating as reasoning models improve?

The K-12 knowledge ceiling caps how hard any single item can get without crossing into university-level material, so as reasoning models climb toward the high 90s on the current seeds, General 365 will need fresh harder problems to stay discriminating. Adding those without reintroducing knowledge dependence is the structural tension the design sets up, and it is the most likely reason a future v2 would drift away from a strict K-12 line.