When Vibe-Coded Software Is Safety-Critical, Who Verifies It?

A June 2026 arXiv preprint argues that vibe coding, the practice of accepting LLM-generated source from a natural-language prompt with minimal review, has no path to certification under the standards that govern aviation, automotive, and industrial functional safety. Its proposed fix is to stop asking the model to verify itself and instead extract formal models from the generated code that a human can audit and sign against (arXiv:2606.22413).

The timing is why this matters now. Minimal-review, high-velocity AI coding is already a shipped product category: Mistral’s “Vibe”, formerly Le Chat, is marketed as an agent where “async agents write, test, and open PRs while your team is offline.” The term has crossed into consumer framing; Merriam-Webster’s usage tracker cites a Verge report that Apple is inviting users to “essentially vibe-code their own extensions.” None of that software is safety-critical today. The preprint’s concern is the trajectory: the hands-off drafting pattern that ships a browser extension now is the same pattern that will be tempted into a dosing pump later.

Where does vibe coding run into a certification wall?

Vibe coding is fine for a browser extension or a landing page. It has no certification route into the software that controls a brake pedal, a dosing pump, or a flight surface, because the standards that govern those systems do not accept “the model wrote it” as evidence of anything.

The paper defines vibe coding as accepting LLM-generated source from natural-language intent with minimal review, and argues it is adequate for low-criticality consumer software but offers “no path to certification” for systems governed by DO-178C, IEC 61508, or ISO 26262. Those three standards share a premise that is the inverse of vibe coding: each requirement must be met, each line of code tied to a requirement, and each verification activity recorded with a result. Minimal review is the practice they exist to forbid.

The conflict is structural rather than incidental. A certification authority under DO-178C does not care how cleverly the prompt was phrased. It asks for verification evidence tied to objectives, and it asks a human to sign that the evidence satisfies them. Vibe coding optimizes for shipping velocity by compressing the review step that certification exists to expand.

How does Forge close the verification loop?

Forge, named in the paper as the Formal method Oriented Refinement loop for GEnerated code, is a closed-loop pipeline that generates Java through vibe coding and then uses model-driven engineering to extract formal artifacts checked by three complementary verifiers.

The architecture gives the LLM a single, honest role: draft generator. The model-driven engineering chain is the discriminator. Generated Java is fed through an extraction step that produces three kinds of formal artifact, each handed to a tool built to refuse unproven claims:

Dafny performs deductive verification, checking that code satisfies stated preconditions and postconditions.
CSP refinement is discharged by the FDR4 Failures-Divergences Refinement checker, which asks whether a concurrent implementation is a valid refinement of an abstract specification.
Z-Machines models are checked through theorem proving in Isabelle.

Every verification failure is converted into a structured correction prompt that drives the next code-generation iteration. The paper’s pitch to the developer is that they “never have to read the formal models.” The loop is meant to absorb the formal-methods expertise a working developer lacks, while still producing the artifacts a certification process wants.

The notable design choice is the division of labor. The model does what it is good at, drafting plausible code from intent, and is explicitly denied the job it is bad at, judging whether that code is correct. Judgement is offloaded to tools whose entire purpose is to refuse to lie.

Why not just make the LLM write Dafny or Lean?

The obvious alternative is to skip extraction and ask the model to emit verifiable code directly in Dafny, Verus, or Lean. The paper argues there is a structural reason that does not work, and it is not a model-capability complaint.

Verification-aware languages like Dafny, Verus, and Lean are, in the paper’s words, “scarce in pretraining data and absent from industrial toolchains.” A model asked to prove its own work in a language it has barely seen tends to produce proof scripts that look right and fail to discharge. Extraction sidesteps the problem: the model writes ordinary Java, which dominates its training distribution, and the formal models are derived from the output rather than demanded of it.

This is the part of the argument that reads as durable regardless of how the empirical claims hold up. If the bottleneck on verifiable AI code were model cleverness, scale would solve it. If the bottleneck is the distribution of verification languages in training data and toolchains, scale does not obviously solve it, and extraction is a reasonable workaround. That distinction is worth holding onto the next time a larger model is claimed to write provable Dafny out of the box.

If the model writes the code, who signs off on it?

This is where the paper becomes more interesting than its mechanics. The accountability consequence it sets up, which is this article’s read of the framing rather than a stated finding, is that the auditable object shifts from the prompt that generated the code to the verification artifact extracted from it.

Under a traditional certification regime, the human signs a body of evidence: requirements, design, code, tests, traceability. Under a vibe-coding regime without something like Forge, there is no clean thing to sign. The prompt is not a spec; the generated code is not authored in any reviewable sense; the model cannot be deposed. By routing verification through extracted formal models, the paper’s framing hands the human a discrete artifact to sign against: the discharged proofs, the refinement checks, the theorem-prover outputs.

That reframing raises the cost of shipping AI-written code into regulated domains rather than lowering it, because it demands a verification artifact where none was previously produced. It also exposes a gap in how current AI-coding tooling assigns responsibility for failures. An async agent that writes, tests, and opens a PR while the team is offline (Mistral’s Vibe) has no built-in notion of a certification artifact. If the generated function controls a brake, the tooling does not answer who signs off or against what spec. The preprint’s contribution is to make that question concrete by producing something to point at.

What does the paper actually prove?

The honest reading is narrower than the framing. The paper reports that the pipeline “produces standards-relevant verification evidence for LLM-generated Java,” which the authors characterize as “a step toward certification.” Both phrases are theirs, and both are hedged on purpose.

The abstract reports no quantitative pass rates, no benchmark scale, no loop-iteration counts, and no runtime cost. The work is a 38 KB preprint, posted as arXiv:2606.22413 in versions dated 21 June and 23 June 2026, submitted by Ran Wei, and not peer-reviewed. It is one group’s framing claim about a direction, not a measured result about a system that has converged on industrial-size problems.

That is not a dismissal. A short preprint that proposes a clean generator-versus-discriminator split and names the pretraining-data gap as the structural obstacle is a useful contribution even without pass rates. It is a contribution about the shape of a problem, not a claim that the problem is solved.

What does this mean for a regulated shop, and where does vendor hype outrun evidence?

For engineering leaders in aviation, automotive, and industrial control, the usable takeaway is a pattern, not a product: let the LLM draft Java, auto-extract formal models that discriminate, and treat the verification artifact rather than the prompt as the object you audit against DO-178C, ISO 26262, or IEC 61508.

The pattern matters precisely because the surrounding market is moving the other direction. Vibe coding is sold as a velocity feature for teams that want async agents shipping code unattended (Mistral’s Vibe), and the term is migrating into consumer contexts with real attack surface (vibe-coded extensions). The certification gap is widening exactly as the drafting pattern spreads.

Two disambiguations are worth stating plainly, because the names collide. The paper’s Forge is not Mistral’s “Forge” fine-tuning service, and it is not Mistral’s “Vibe” coding agent. Three distinct products share two syllables. They are unrelated; the convergence is lexical, not technical.

The realistic near-term outcome is not that safety-critical systems start shipping vibe-coded, certified code. It is that the question of who verifies generated code, and against what artifact, moves from hypothetical to tractable. The preprint offers one answer: the verifier does, against an extracted formal model, and the human signs the result. Whether certifiers will accept that answer is a question the preprint does not, and cannot, settle.

Frequently Asked Questions

Why does Forge target Java when certified firmware runs C, C++, and Ada?

Real DO-178C and ISO 26262 systems run C, C++, or Ada, not Java. The extractor targets generated Java because the model drafts it competently, but that sidesteps the harder problem of producing formal models for the languages actual certified firmware uses, including SPARK, the verified Ada subset already deployed in Level A avionics projects.

How does Forge differ from seL4 and CompCert, which already verified safety-critical code?

seL4 and CompCert took years of expert effort to hand-write proofs in Isabelle and Coq for codebases of tens of thousands of lines. Forge inverts the labor: it extracts proofs automatically from LLM-generated code and feeds each failure back as a correction prompt. The tradeoff is that it verifies generated code of unknown quality rather than hand-written code experts controlled line by line.

Under DO-178C, would the Forge tool chain itself need qualification?

Yes. DO-178C Section 12.3 requires any tool that eliminates or reduces verification effort to be separately qualified, at TQL-1 or TQL-2 depending on its role. The extractors and the Dafny, FDR4, and Isabelle provers would each need qualification evidence before a certifier accepted their output, so the artifact Forge produces does not by itself close the certification loop.

What happens if the correction loop fails to converge?

The paper reports no iteration counts, so convergence is unaddressed. The provers inherit known termination problems: Z3, which backs Dafny, can time out on undecidable SMT instances, and some Isabelle obligations require human insight to discharge. A failing proof that produces a vague correction prompt could cycle the generator indefinitely, spending model calls without producing an artifact to sign.

Does Forge remove the need for formal-methods expertise on the team?

No. The paper lets developers skip reading the formal models, but someone must still write the specification those models check against, and a proof only shows the code meets the spec, not that the spec captures the real safety requirement. Authoring a correct formal spec is the expensive part of formal methods, and Forge relocates that burden rather than removing it.