groundy
ethics, policy & safety

50 Years of Aviation Certification Expose a Structural Gap in AI Governance

A June 2026 arXiv paper argues AI governance lacks the epoch limits and proof surfaces that 50 years of aviation certification built in, breaking one-time approval stamps.

9 min · · · 3 sources ↓

Christiaan Zietsman’s “Fifty Years of Specification Completeness” (arXiv:2606.25120), submitted 23 June 2026, argues that since 1992, civil aviation has certified safety-critical software against three structural requirements that no AI governance framework currently requires of the documents that govern a model: a traceable link between governing specifications and operational evidence, a validity bound tied to operational context, and an objective definition of what counts as sufficient proof. Aviation codifies these properties today in DO-178C and DO-330, and their absence in AI governance is the structural reason a deployment-time certification stamp cannot stay valid.

What has aviation required of safety-critical software since 1992?

Since 1992, the FAA and EASA have certified airborne software against three structural requirements embedded in the DO-178 lineage and its tool-qualification companion DO-330, and the paper treats them as the load-bearing spine of the whole regime. The first is structured governance linkage: every requirement flowing from a governing specification must trace to concrete operational evidence that the system satisfies it. The second is context-bounded validity: a certification holds only within the operational envelope it was approved under, and a change to that envelope triggers requalification. The third is an objective evidence architecture, a shared and auditable definition of what proof means and what makes it sufficient to close a requirement.

These are enforced properties, not aspirations. They are operationalized through the certification process itself, and aviation’s own standard-setting bodies have publicly acknowledged that their frameworks break down for AI systems, according to the paper. The point is not that aviation solved safety. Aviation built a self-consistent loop in which a requirement, the evidence for it, and the conditions under which that evidence stays valid are all tracked together. AI governance, the paper argues, has none of that machinery applied at the artifact level, which is the gap it spends the rest of its nine pages naming.

Why don’t aviation’s requirements transfer directly to an AI system?

They do not transfer to the system, and the paper is explicit about why. DO-178C’s apparatus assumes you can trace a requirement to deterministic behavior and re-execute it to reproduce the evidence. A large language model or a learned policy is non-deterministic: the same input does not produce the same output, so the traceability-to-evidence loop that closes a requirement in flight-control software does not close for a model’s behavior. Requalification on change, the other pillar, also falters when “change” is continuous and largely unobservable at the weight level.

Zietsman’s move is to relocate the argument one layer down. The governance artifact, whether a system prompt, an AGENTS.md file, a governance policy, or a task envelope, is a static, deterministic artifact. Its structural properties can be evaluated independently of the stochastic system it governs. Traceability, context bounds, and objective evidence are properties a document can have even when the system it constrains cannot. That is the core conceptual move of the paper, and it is worth separating from the stronger claim it sits next to.

The stronger claim, that you certify an AI the way you certify a flight-control computer, is the one the paper concedes. EASA and FAA have already said their regimes fracture for non-deterministic systems, which is why a naive transfer does not survive scrutiny. The weaker claim, that you certify the governance document the way you certify a software requirements baseline, sidesteps that objection rather than answering it. The paper’s contribution lives in the weaker claim, and so does its main limitation: it governs the document, not the model, and a well-structured prompt can still preside over a misbehaving system.

What revalidation mechanism does AI governance lack?

The paper coins two terms, epoch limits and proof surfaces, to name the revalidation mechanism AI governance has no analogue for. An epoch limit is DO-330’s requalification trigger recast at the document layer: a governance document is valid only for an epoch, a bounded window of operational context, and when that context drifts the document must be revalidated. A proof surface is the feedback channel, the set of objective evidence a governance document must produce to demonstrate it still holds.

The absence of both is what the paper calls the structural gap. Today’s AI governance instruments carry no expiry and no required evidence channel, so nothing forces revalidation when the context that justified their approval changes. A system prompt approved against a specific model, tool set, and data environment has no defined point at which that approval lapses, and no defined proof it must surface to confirm it still applies. The paper maps this one-to-one onto the aviation lineage: DO-178C’s traceability architecture becomes the structural-completeness requirement on governance documents; DO-330’s requalification triggers become epoch limits; DO-178C’s objective evidence requirements become proof surfaces. Three aviation requirements, three AI-governance findings, lined up in that order.

How often do today’s AI governance artifacts fall short?

An empirical companion study (arXiv:2604.21090, “Structural Quality Gaps in Practitioner AI Governance Prompts”) gives the structural-quality argument a concrete number: 37% of evaluated AI governance file-model pairs fell below the structural quality threshold. That figure comes from a separate empirical paper, not from the main paper’s own measurements, so attributing it to Zietsman directly would conflate two studies. Read alongside the main paper, it suggests that if you applied aviation’s structural bar to today’s system prompts, AGENTS.md files, and policy documents, a substantial share would fail on day one, before any context drift sets in.

The 37% is a threshold-crossing rate, not a quality score, and the companion paper’s methodology and sample are where its limits live. The brief does not give the sample size or the threshold definition, so the figure is best cited as the companion’s headline result rather than as a precise population estimate. What it does for the main paper is turn a conceptual argument into an existence proof: the gap is not hypothetical, because a measurable fraction of real artifacts already sits below the bar the paper says they should clear.

Is there a framework that already operationalizes this?

The paper cites PromptQ’s seven-principle framework as operationalizing the three requirements at the governance document layer. One disclosure is worth stating plainly: PromptQ is the author’s own framework, so the citation functions as a pointer to his prior work as much as an independent reference. The confidence the brief assigns to this claim is medium, so the safest reading is that PromptQ is one proposed instantiation of the document-layer argument, not a validated standard. Treat the framing as the author’s.

The broader neighborhood points the same way. A community paper-check listing situates arXiv:2606.25120 alongside related work including a TÜV AUSTRIA Trusted AI framework that translates EU AI Act obligations into testable criteria, and a separate paper arguing that watermarking without standards is symbolic rather than effective compliance. The common thread is that several groups are independently converging on the same diagnosis: current AI governance instruments are underspecified at the level where an auditor would actually test them.

What does this change for an AI auditor?

The payload for auditors and ML governance leads is the invalidation of the one-time stamp. Under a DO-330-style regime, context drift is a requalification trigger and proof surfaces are how revalidation is demonstrated, so the burden shifts from approving a system once at deployment to continuously re-certifying the governance artifacts as the operational context moves. A safety case signed off at deployment does not stay valid by default; it stays valid only for as long as the context that justified it holds, and only if evidence is surfaced to confirm it.

The 37% companion figure implies most current artifacts would not clear the bar at initial approval, before revalidation is even in scope. The contested status of the analogy, EASA and FAA conceding their regimes fracture for non-deterministic systems, means the paper’s defensible claim is the document-layer one: you can impose aviation’s discipline on the artifact even where you cannot impose it on the model. That is a narrower, more honest result than “certify AI like aircraft,” and it is the one an auditor can actually act on.

The competitive context sharpens why this matters. Mainstream AI-certification coverage centers on the EU AI Act, ISO/IEC 42001, and NIST AI RMF, regulatory and management-system frameworks that operate above the artifact. Few of them treat a system prompt or an AGENTS.md file as a thing that should inherit aviation’s traceability and requalification discipline. The paper’s contribution is to pull that lens from DO-178C and DO-330 and apply it where the auditor’s hands actually rest: the static document that governs a non-deterministic system, and the only surface in the stack that can be inspected, bound to a context, and given an expiry.

Frequently Asked Questions

How does this differ from ISO/IEC 42001’s treatment of governance documents?

ISO/IEC 42001, published in December 2023 as the first certifiable AI management-system standard, certifies the organizational process that produces AI rather than any single artifact governing a model. NIST AI RMF’s Govern, Map, Measure, and Manage functions sit similarly above the document layer. Neither imposes per-artifact traceability, validity bounds, or evidence channels on system prompts or AGENTS.md files, which is the specific gap the paper names.

How does DO-330 requalification differ from MLOps retraining triggers?

DO-330 requalification fires when a qualified tool’s operational context changes, and the burden is on demonstrating that prior approval still holds. MLOps retraining triggers typically fire on performance-metric decay, which is downstream of context drift. Under the paper’s epoch-limit framing, revalidation would be required on any context change such as a model swap, tool-set change, or data-environment shift, regardless of whether evaluation metrics have moved.

What would count as a proof surface for an AGENTS.md file in practice?

Concretely, a proof surface would pin the model version, dependency versions, and data sources the document was approved against, then capture an evaluation trace showing it held for those inputs. A changelog linking each document revision to an explicit revalidation decision, plus retained eval-suite logs keyed to those version pins, would approximate what DO-178C calls objective evidence at the document layer.

Where does the document-layer argument break down?

The document-layer transfer preserves aviation’s structural form without carrying its safety guarantee. A governance prompt can satisfy traceability, context bounds, and evidence requirements while the model it constrains still misbehaves, because a static artifact cannot encode the stochastic system’s full behavioral envelope. The discipline audits the document; it does not bind the model’s output.

What is the near-term adoption risk under the EU AI Act?

The Act’s high-risk-system obligations apply from 2 August 2026, roughly five weeks out, yet neither the Act nor its current harmonized standards mandate per-document epoch limits or proof surfaces. If harmonized standards eventually adopt those document-layer requirements, the 37% structural-quality figure from the companion study implies most existing high-risk governance artifacts would need revision before conformity assessment rather than after.

sources · 3 cited