GovernSpec Contractual Skills Make Agent Governance Auditable Before Runtime

The title asks whether a GovernSpec DSL can replace bolt-on agent governance. The short answer from the paper itself: no, and the authors do not claim it can. What arXiv:2605.22634 actually proposes is a contract-first design framework that moves governance declarations into the skill definition before an agent ever runs. Runtime enforcement remains mandatory.

What contractual skills actually are

The GovernSpec framework, introduced by Ting Liu, reorganizes the skill definition file (a SKILL.md paired with YAML) into a structured contract. Each contract exposes nine elements at definition time: goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules. Every side effect, input constraint, and audit hook is declared before execution, not discovered after a runtime violation.

This is not a new domain-specific language. The paper frames GovernSpec as a design convention layered on top of existing YAML and markdown artifacts. Any team writing agent skills in the current ecosystem could adopt the structure without switching tooling. The title’s reference to a “DSL” overstates what the framework is.

The experiment and the uncomfortable result

The evaluation covered three enterprise skills, 15 synthetic tasks, four instruction conditions, and eight generation models, producing 960 outputs and 1,680 cross-judge score records. Contractual skills beat no-skill and minimal-skill baselines across all eight models.

The uncomfortable finding is what happened when the authors compared contractual skills against plain expanded skills (definitions that include detailed instructions but lack the formal contract structure). There, the gains were small and mixed. Some models showed marginal improvement; others showed negligible difference. The paper’s own conclusion frames the result honestly: contractual skills are “a governance layer that makes task intent, boundaries, and acceptance criteria explicit, not as a standalone safety mechanism.”

A separate tool-calling challenge reinforced the point. Across eight models and 192 simulated tool-call records, contractual skills usually reduced high-risk tool attempts. But model-to-model variance persisted. Some models complied with the contract; others ignored it at rates that would matter in production.

The formal-verification companion

A companion paper (arXiv:2605.23951) ships three composable formal-verification methods under the name “enclawed.” The three layers are static capability-containment analysis, refinement-type rejection of undeclared tool calls, and SMT-bounded model checking. The implementation is zero-dependency JavaScript with 53 unit tests.

The formal stack proves sound coverage with one acknowledged residual: it cannot force the LLM to act. The model can always refuse or produce no output, a correctness gap no type system can close. This is a useful result for teams building auditable agent pipelines because it draws a precise boundary around what formal methods can guarantee and what they cannot.

Why runtime guardrails still matter

Evidence from a third paper, Trajel (arXiv:2605.24219), underscores the limits of definition-time governance. Trajel found that nearly half of hallucinated multi-agent trajectories involve multiple failure types simultaneously, and automated detectors with high binary accuracy still miss the subtlest categories.

This is the core tension in contract-first governance: the contract declares what the skill should do, but the agent executing it operates probabilistically. A well-specified SKILL.md does not prevent a model from hallucinating a tool call the contract never authorized. The formal-verification stack can reject the call after the fact, but only if the runtime layer surfaces it for checking. Remove the runtime guard and the contract is a document, not a constraint.

What this means for audit teams and platform vendors

The practical value of GovernSpec lies in auditability, not safety replacement. A skill that ships with a declared contract gives compliance teams a machine-checkable artifact: here are the declared inputs, the permitted side effects, the required human approval points. Auditors can verify the contract matches the deployment without reading runtime logs retroactively.

For platform vendors building agent orchestration layers, the framework suggests a standard: every published skill in a marketplace or template library should carry a governance contract. Skills that do not are, by definition, unauditable. Whether any specific template library adopts this convention is speculative; the paper names no platforms. But the pressure direction is clear. Enterprise buyers already ask for SOC 2 and ISO 42001 artifacts. A governance contract per skill is a natural extension of that checklist.

Contractual skills and runtime guardrails are complements, not substitutes. GovernSpec makes the agent’s intent machine-readable. The runtime enforces what the model actually does. Neither is sufficient alone, and the paper’s own evidence says so directly.

Frequently Asked Questions

What gap does contract-first governance fill that runtime guardrails alone do not?

Runtime guardrail middleware intercepts agent actions during execution but has no visibility into whether the skill’s declared intent matches the deployment’s compliance requirements. GovernSpec fills the design-time gap: before an agent runs, the contract exposes whether the skill declares the right input boundaries, side effects, and approval gates. Most current governance tooling focuses on the runtime side, leaving the pre-deployment audit surface unchecked.

What failure modes survive both contractual skills and runtime enforcement?

Trajel’s evidence on compound trajectory failures means that even a combined contract-plus-runtime stack faces cases where multiple simultaneous errors interact in ways no single layer catches. The enclawed formal-verification stack also acknowledges one residual it cannot close: it cannot compel the model to act. Silent refusals and null outputs are neither contract violations nor runtime infractions, yet they represent a real failure mode in production pipelines.

Can the enclawed verification modules be adopted independently of GovernSpec?

The three enclawed layers ship as independent, composable, zero-dependency JavaScript modules. A team could apply just the refinement-type checker to reject undeclared tool calls without adopting the GovernSpec YAML convention. The tradeoff: without a contract definition to validate against, the checker degrades into a generic tool-call allowlist rather than a capability-boundary enforcer.

Do contractual skills address multi-agent handoff failures?

The GovernSpec contract includes handoff rules as one of its nine elements, but the paper’s evaluation tested single-skill generation tasks, not multi-agent handoff chains. Trajel’s findings on compound trajectory failures in multi-agent workflows suggest that handoff boundaries are exactly where definition-time contracts are weakest: each contract governs a single skill’s behavior, not the interaction surface between two independently contracted agents.

What happens to a GovernSpec contract when the underlying model is swapped?

The tool-calling challenge’s model-to-model variance means a contract verified against one model may produce different violation patterns against another. Since the contract declares intent rather than specifying weights, teams that swap or upgrade models need to re-run the enclawed verification stack rather than assume the contract transfers. The paper’s 960-output experiment across eight models provides a baseline for how much compliance variance to expect across model changes.