When LLM-Generated CUDA Kernels Pass Tests but Get the Math Wrong

By mid-2026, CUDA kernel generation has reached a point where a language model can write code that compiles, launches successfully on an H100, returns numbers in the right shape, and still be mathematically wrong in ways that surface weeks later as degraded training loss or a model that underperforms in production.

A June 2026 paper on arXiv makes this concrete by examining whether the correctness checks used by KernelBench, TritonBench, and GEAK would actually catch bugs in generated kernels. The answer is that the standard approach (allclose-style numerical proximity checks on fixed shapes) misses what the authors call transcription-style bugs. Their protocol, op-schema-aware seeded fuzzing against an fp64 CPU reference with per-(op, dtype) absolute tolerances, caught 10 of 10 buggy kernels (“illusions”) while passing 16 of 16 correct controls across 26 operations on five GPU architectures (RTX 3060, A10, L40S, A100 SXM4, H100 NVL).

Why “it ran without crashing” is not a correctness signal

This is a structural artifact of the CUDA execution model, not an edge case. Kernel launches are asynchronous: the host thread does not block waiting for the GPU to finish, and cudaGetLastError() can return errors from previous asynchronous launches, not necessarily the one you just issued. A validation loop that checks whether a kernel crashes is checking the wrong event at the wrong time. A kernel can return to the host cleanly while the GPU is still computing incorrect results, or has silently accumulated floating-point rounding errors through a reduction order that works on the test shape but diverges on others.

StartupFortune’s May 2026 analysis describes these errors surfacing as degraded benchmarks, unstable loss curves, or subtly wrong production systems, none of which trigger a runtime exception that would have caught the problem at test time.

What existing benchmarks actually measure

ArXiv:2606.20128 does not measure LLM output quality directly. It measures whether the evaluation frameworks that other papers use would catch bugs if an LLM introduced them. KernelBench, TritonBench, and GEAK, as evaluated in that paper, use fixed-shape, small-sample numerical proximity checks: run the kernel on a representative input and check whether the output falls within tolerance of a reference. That is adequate for obvious failures, a kernel that returns zeros or NaN, but it misses transcription-style bugs that produce plausible numbers on the benchmark’s training shapes while failing on the shape distribution a production model actually encounters.

The paper’s fuzzing protocol changes two things: it varies the input shapes across the full operator schema, and it compares against an fp64 CPU reference rather than a same-precision GPU baseline. Those two changes caught all 10 of 10 illusions in the extended corpus across five GPU types. The initial corpus showed the same pattern: 9 of 9 buggy variants detected, 15 of 15 correct controls passing.

A correctness score earned on standard benchmark evaluations is a lower bound on actual correctness. The gap between benchmark-correct and production-correct is what the paper is quantifying.

What HuggingFace’s kernel-builder actually validates

HuggingFace’s published runtime speedup figures are benchmarked numbers from those runs; the correctness validation behind them is a separate question. The kernel-builder agent skill, driven by Claude Opus 4.5, produced kernels for LTX-Video and Qwen3-8B, with reported average RMSNorm speedups of 1.88x and 1.94x on H100 versus PyTorch baselines. Those are runtime benchmarks.

The upskill validation framework uses assertion-based test cases: input/output pairs that check whether the generated build configuration contains expected values such as GPU compute capabilities (“9.0” for H100) or required CUDA headers. The improvement on sonnet after skill injection, per the upskill evaluation, reflects whether the model generates contextually appropriate code configurations. There is no documented fp64 reference comparison in the upskill loop. The two numbers describe different properties: 1.88x is a runtime measurement; the pass-rate score is a code-generation conformance measure.

Practitioners who adopt this workflow to ship custom kernels inherit a validation methodology optimized for generation speed, not for numeric fidelity at the operator level.

What the CUDA-LLM “correctness guarantee” actually covers

The CUDA-LLM FSR framework’s correctness guarantee means passing functional test cases, not demonstrating numeric equivalence against a high-precision reference. Published in June 2025, the CUDA-LLM FSR paper jointly optimizes for compilation correctness, functional correctness, and runtime latency, with the authors reporting that generated kernels “consistently guarantee correctness rates” and speedups of up to 179x over human-written code.

Both claims are consistent with arXiv:2606.20128’s findings once you resolve the vocabulary. “Correctness” in the FSR framework means the same kind of fixed-shape, pass/fail evaluation that the June 2026 paper shows is insufficient for catching transcription-style bugs. A kernel can score “consistently guaranteed” by that definition and still carry errors that only op-schema-aware fuzzing surfaces. Speed and numeric correctness are orthogonal properties in GPU kernel evaluation; the 179x figure is a runtime metric, not a statement about floating-point fidelity across the full input domain.

What rigorous verification actually requires

ArXiv:2606.20128 establishes what catching transcription-style bugs requires: per-operator schema awareness (so test inputs cover the actual dtype and shape distribution the kernel will encounter), an fp64 CPU reference (not a same-precision GPU baseline), and per-(op, dtype) absolute tolerances calibrated to the precision of the operation. All three are necessary; any two leave the gap open.

StartupFortune’s analysis also references KernelBench-X, a May 2026 benchmark covering 176 GPU-kernel tasks that flags numerical precision as an area where LLM-generated kernels still fall short; and Model2Kernel, a March 2026 study that found hundreds of previously unknown bugs in kernels from production model-serving environments. Both are described secondhand in that piece, and their specific metrics would need primary-source verification before treating them as firm numbers.

The practical hierarchy for teams shipping AI-generated kernels:

Crash absence is not a signal. A clean cudaGetLastError() tells you about the previous launch.
Fixed-shape allclose checks are necessary but not sufficient. They catch obvious failures; they miss transcription-style bugs that only appear outside the benchmark’s training shapes.
Compare against an fp64 CPU reference. Implement the same operation in NumPy or SciPy at double precision and compare with per-(op, dtype) tolerances.
Vary shapes and dtypes systematically. An op-schema-aware fuzzer that exercises the kernel across the actual input distribution is the only path to surfacing bugs before production does.

The agent loops and skill-injection workflows that produce the headline speedups as of mid-2026 are optimized for throughput. Adding a reference-output comparison pass is a separate build step, and it is the one that closes the gap between “passed the smoke test” and “correct.”

Frequently Asked Questions

Is this silent numerical drift problem limited to LLM-generated CUDA kernels, or can hand-written kernels have the same issue?

It is not limited to LLM output. Model2Kernel, published in March 2026, verified memory safety for CUDA kernels taken from real model-serving environments and reported hundreds of previously unknown bugs, including ones in production kernels written by people. KernelBench-X flags numerical precision as an unresolved failure mode across the broader set of generated and benchmarked kernels it tests.

How does the CUDA-LLM FSR framework define correctness compared with the June 2026 arXiv verification protocol?

FSR jointly optimizes for compilation success, functional test-case pass, and runtime latency, reporting speedups of up to 179x over human-written code. The June 2026 protocol instead uses op-schema-aware seeded fuzzing against an fp64 CPU reference with per-(op, dtype) absolute tolerances. One asks whether the kernel runs and matches expected values on fixed shapes; the other asks whether it is numerically equivalent across the full operator input domain.

What is the lowest-friction way for a team to add the missing verification step to an existing kernel CI pipeline?

Add a reference-output comparison pass. Implement the same operator in NumPy or SciPy at fp64, parameterize the test over representative shapes and dtypes that cover the op schema, and compare the GPU output using per-(op, dtype) absolute tolerances. It can run on CPU, needs no dedicated fuzzer to start, and will catch transcription-style bugs before they reach production.

Can a kernel pass both fixed-shape allclose checks and an fp64 reference comparison and still be unsafe?

Yes. Model2Kernel found memory-safety defects, such as out-of-bounds accesses and race conditions, in kernels whose outputs could still look numerically plausible. Because CUDA launches are asynchronous, those defects can corrupt memory silently rather than crash. Numeric equivalence checks and memory-safety verification are separate gates.

Why would autonomous kernel-building agents resist adding op-schema-aware fuzzing by default?

The current agent workflows are optimized for generation throughput and headline speedups, such as HuggingFace’s reported 1.88x and 1.94x RMSNorm speedups on H100. Adding schema-aware fuzzing with fp64 reference comparisons increases validation wall-clock time and requires maintaining a CPU reference for every operator, which conflicts with the publish rate these autonomous loops are designed to sustain.