A model that leads SWE-bench or LiveCodeBench can still fail to produce a valid parametric CAD program, and CADBench is the first unified multimodal benchmark to make that gap measurable. Posted as v2 on arXiv on June 18, 2026, it evaluates eleven systems across 18,000 samples and concludes that code-generating vision-language models are “far from reliable CAD program reconstruction” under idealized inputs.
What does CADBench actually test?
CADBench tests whether a model can emit a parametric CAD program from a 3D input: the editable history of sketches, operations, and constraints that builds the part, not just the finished solid. That distinction is the whole point of the benchmark. A model that outputs a watertight mesh has produced geometry; a model that outputs a parametric program has produced something an engineer can revise, re-parameterize, and feed into a downstream manufacturing or simulation process. CADBench scores the latter, and the difference is where most of the difficulty lives. Reproducing a shape is an inverse-rendering problem; reconstructing the sequence of design intent behind it, in a form the CAD kernel will accept and a human will later edit, is a program-synthesis problem layered on top of a geometry problem.
The benchmark assembles 18,000 evaluation samples across six families built on five source datasets: DeepCAD, Fusion 360, ABC, MCB, and Objaverse. Samples deliberately span five input modalities, because real CAD ingestion rarely starts from a clean artifact:
- clean meshes
- noisy meshes
- single-view renders
- photorealistic renders
- multi-view renders
Six metric families cover three dimensions. Geometric fidelity asks whether the output matches the target shape. Executability asks whether the program actually runs and produces a valid solid rather than a kernel error. Program compactness asks whether the reconstruction is the minimal, editable program or a bloated approximation, which matters because a model can reproduce a part faithfully while emitting a degenerate program that lists every face as a separate operation instead of the original sketch, extrude, and fillet sequence. STEP-based families are additionally stratified by B-rep face count so results can be sliced by part complexity, and every family is diversity-sampled to keep objects from clustering on easy cases.
Why doesn’t a strong code benchmark predict CAD ability?
General coding benchmarks like SWE-bench, LiveCodeBench, and HumanEval measure software-engineering competence inside textual codebases: patch a bug, implement a function, resolve a failing test. CAD program generation is structurally different. The output has to describe a closed, manufacturable solid, respect sketch constraints, and preserve topology when a fillet, shell, or boolean operation lands on it. Valid syntax is the floor, not the ceiling.
The deeper mismatch is in the oracle. SWE-bench has a single, discrete check: does the test suite pass. CAD program generation has no equivalent. Executability, geometric fidelity, and program compactness can all disagree on the same output, which is why CADBench needs six metrics where SWE-bench needs one. A program that runs can still describe the wrong shape; a shape that matches can still come from a bloated, unmaintainable program; a compact program can fail to execute. None of these failures is caught by a line-level test harness, because none of them is a code-correctness failure in the sense a software benchmark measures.
A model that writes clean Python has been tested on none of the geometry. Whether a sketch loop closes, whether a dimension over-constrains the part, whether a pad references a face that a later feature deletes: these are geometric and topological failures, and the correlation between “writes good code” and “writes good CAD programs” is an assumption teams make by default. The practical consequence is that a procurement decision which sorts coding assistants by SWE-bench position is selecting on a signal the benchmark never validated for this domain. The same logic reaches EDA tooling, where generating a constraint-correct placement or a valid netlist is a geometric and constraint-satisfaction problem dressed up as code.
What can the arXiv preprint verify, and what can’t it?
The abstract documents the benchmark design and two qualitative verdicts, and those are the only results that can be checked today. The first verdict: specialized mesh-to-CAD models “substantially outperform” code-generating VLMs under idealized inputs. The second: the VLMs themselves are unreliable at CAD program reconstruction. Both are the authors’ characterization of their own results, not an independently reproduced comparison, and neither arrives with per-model numbers, named baselines, or a published scoring procedure for partial geometric correctness.
Two implications follow. The generation scale, more than 1.4 million programs across eleven systems, is a throughput figure, not an accuracy guarantee; a large run can still produce a leaderboard the full paper revises when per-model numbers land. And the partial-geometric-correctness procedure is the specific detail to watch when the full release arrives, because that is where authors have the most latitude to define what counts as “good enough,” and a score under one definition is not comparable to one scored under another.
What do the three failure modes imply for model choice?
CADBench surfaces three recurring failure modes, and each one is a selection criterion in disguise.
Complexity sensitivity. Reconstruction quality degrades as geometric complexity rises. More B-rep faces and more operations in the target program, and the model’s output drifts from the reference. A demo on a simple bracket or plate proves nothing about performance on a multi-body assembly, and the STEP-based face-count stratification is what turns this from an anecdote into a measurable axis. Complexity sensitivity is also the hardest failure to patch with more training data, because it tracks the combinatorial structure of the program rather than a single missing capability.
Modality-shift brittleness. CAD-specialized models, trained on clean geometry, falter when handed a photorealistic render or a noisy scan. General-purpose VLMs tolerate input variation better but remain unreliable, per the authors. There is a trade-off between specialization and robustness, and CADBench exposes it by holding the target fixed while perturbing the input. If your real workflow starts from scans or renders rather than clean meshes, the specialist that tops the clean-input leaderboard may be the wrong pick.
Metric-dependent rankings. A model can lead on executability (its programs run) and trail on geometric fidelity (the shape is wrong) or compactness (the program is bloated). A single leaderboard position therefore misrepresents the trade space. The metric that matters depends on the use case: a runnable-but-approximate program may be acceptable for a rough concept model; an accurate shape is required for a manufacturing-ready part; a compact, editable program is required when the output must be maintained and re-parameterized.
How should a team vet a coding assistant for CAD or EDA work?
Treat aggregate coding scores as non-transferable to domain-specific program generation. Pressure-test candidates against the actual target geometry and the actual constraint format, and settle the evaluation axis before benchmarking: whether you need an editable parametric history or just a mesh, and whether your inputs arrive as clean geometry or as noisy real-world captures. Those two answers determine which CADBench metric and which modality slice map to your workload.
Until CADBench publishes per-model numbers, its most useful output for a buying decision is not a leaderboard but the testing protocol it implies. General coding benchmarks were never going to answer this question; they measure a different skill, and CADBench is the reminder that the difference is load-bearing when the output has to be geometry an engineer can actually edit.
Frequently Asked Questions
Does CADBench test generative design or only CAD reconstruction?
CADBench is a reconstruction benchmark. A target part already exists, and the model must recover the sketch, operation, and constraint sequence that produced it. It does not test forward design from a functional specification, so a model that scores well has shown it can invert an existing part, not that it can originate a novel one from a blank drawing or performance requirement.
How do the five source datasets behind CADBench differ?
The sources import different acquisition biases. DeepCAD contributes command-sequence logs of designer modeling sessions, ABC contributes roughly a million parametric B-rep models mined from public engineering repositories, Fusion 360 contributes parts reconstructed inside Autodesk’s tool, and Objaverse contributes non-engineered synthetic 3D assets rather than machined parts. A model that tops one family may have matched that source’s distribution instead of solving the underlying geometry problem, which is why CADBench diversity-samples across all five.
Which scoring decision will most affect how the CADBench leaderboard reads?
The partial-geometric-correctness metric. Three-dimensional reconstruction usually scores shape match with Chamfer distance for average point error, IoU for volumetric overlap, or Hausdorff distance for worst-case error, and each rewards a different kind of approximation. A model can rank well under Chamfer while failing Hausdorff, because a single bad fillet barely moves the average but dominates the worst case, so two fidelity scores are not comparable until the paper names its metric.
What infrastructure does a team need to run a CADBench-style test on its own parts?
An executable CAD kernel. Checking executability and geometric fidelity means instantiating every candidate program in a boundary-representation engine, whether a licensed commercial kernel such as Parasolid, ACIS, or CGM, or an open-source one such as OpenCASCADE. Each evaluation is therefore far costlier than a text benchmark that compiles in milliseconds, and the kernel choice can itself shift executability scores when a model’s output relies on an operation the kernel implements differently.
What published result would most change the CADBench takeaway?
A per-model breakdown sliced by B-rep face count. The headline that specialized models beat code-generating VLMs holds under idealized inputs, but complexity-sensitivity means the gap could invert on high-face-count parts. If the full paper reports face-count-stratified scores showing VLMs closing or flipping the gap as complexity rises, the procurement implication flips with it: the specialist becomes the niche pick and the generalist the default.