Three independent research threads in early 2026 converged on the same structural problem: agent skill registries run on developer attestation, and a survey of community skills1 documented exploitable vulnerabilities across published skills. Metere’s paper2 is the first to treat this as a formalization problem rather than an auditing one, proposing a four-level trust schema and a biconditional correctness criterion that redefines what “skill is correct” actually means.
The Trust Gap in Current Skill Ecosystems
The standard today is SKILL.md: a portable skill-description format that tells a runtime what a skill does, what parameters it accepts, and what outputs it produces. What it does not provide is any attestation mechanism. A skill author declares behavior; the runtime believes them. Metere2 concludes that SKILL.md lacks any verification or attestation layer, making it sufficient for local tooling but insufficient for HITL agent systems where skill side-effects may be irreversible.
Four Levels of Trust, Each Requiring Evidence
The schema in Metere2 defines four verification levels:
- Unverified: the skill has been loaded. No behavioral claims have been validated.
- Declared: preconditions and postconditions are stated in structured form. The runtime can read them; it cannot check them.
- Tested: automated tests exercise the skill against its declared specification. This is where most production skills can realistically land.
- Formal: postconditions are proven correct via mathematical verification. The practical ceiling for most community skills is tested; formal verification requires proofs that scale poorly with skill complexity.
The schema gates runtime permissions on verification level, not developer identity. A skill claiming to write files gets file-write permissions only if its verification level meets the runtime’s configured threshold for that permission class.
Consider a write_file skill. At declared, its SKILL.md states postcondition: file at path P now contains data D. At tested, an automated suite verifies that claim against a range of sample inputs. At formal, a proof assistant has certified the correspondence holds for all valid inputs. Most open-source skill ecosystems currently operate at unverified or declared, with no enforcement mechanism to distinguish between them.
The Biconditional Correctness Criterion
Approval without reconciliation is incomplete assurance. Metere’s criterion2 requires that a skill’s observable side-effects match the approved-and-executed set in the audit log exactly, in both directions. The biconditional has two violation modes:
- A side-effect occurs that has no corresponding approved execution in the audit log (the skill did something not sanctioned).
- An approved execution has no corresponding side-effect (the declared action didn’t happen, which matters for auditors relying on postconditions).
This is financial reconciliation applied to agent actions. The audit log is the ledger; the side-effect trace is the bank statement. A correct skill reconciles perfectly. An incorrect one has unexplained debits.
The practical implication: agent runtimes that maintain approval logs but do not independently verify side-effects are satisfying half the criterion. Log the approvals, ignore the outcomes, and you have a record of intent with no confirmation of execution.
The Empirical Case
According to Metere2, the reference implementation was evaluated against an adversarial ensemble and reported zero false positives and zero false negatives. Those results are self-reported from the author’s own implementation with no third-party reproduction yet. That is not unusual for a May 2026 arXiv submission, but it is relevant context.
The enclawed GitHub repository details the reference implementation: a hard fork of OpenClaw with Bell-LaPadula classification labels, AES-256-GCM encryption, hash-chained audit logs, and 207 unit plus adversarial pen-tests.3 It ships under an MIT license.
Guidelines for Skill Security
The paper2 produces a set of normative guidelines. The most operationally significant:
- Deny by default. Skills receive no permissions unless explicitly granted at a verified level.
- No bypass switch. The trust schema cannot be overridden at runtime. A skill cannot request elevated permissions exceeding its verification level.
- Skills untrusted by default. A newly loaded skill starts at unverified regardless of its declared claims.
- Ed25519 module signing. Skills must be cryptographically signed; the runtime verifies signatures before execution. This closes the most direct supply-chain vector: a skill file modified after publishing.
The remaining guidelines address audit log integrity, classification propagation, and permission scoping. Together they constitute a security-by-design architecture for HITL runtimes, not a checklist applied after the fact.
Three Papers, One Quarter
Metere2 does not arrive alone. Xu and Yan’s survey1 audited community skill registries in February 2026 and documented vulnerabilities across the ecosystem, proposing its own T1, T4 gate framework. That framework is structurally similar to Metere’s four-level schema but independently derived; they are not the same framework.
Li et al.4 published a threat taxonomy in April 2026 covering attack categories across scenarios and layers. The structural findings: no enforced data-instruction boundary in current skill systems, a single-approval persistent trust model (approve once, trusted forever), and no mandatory security review at any major skill marketplace.
Three independent signals, same quarter, same gap. The research community has characterized the problem with enough specificity that “we didn’t know” stops being a credible position for framework authors.
What This Means for Framework Authors
Skill registries shipping today as plugin trees with SKILL.md descriptors have no signed-postcondition layer, no capability gate model, and no audit reconciliation. By Metere’s schema, they operate at unverified regardless of how carefully their authors wrote the documentation.
Adding trust infrastructure retroactively is harder than building it in. A signed-postcondition layer requires agreement on a postcondition format, a signing authority or distributed signing scheme, and runtime enforcement that distinguishes verification levels. None of that is drop-in.
The formal verification tier requires mathematical proof of postcondition correctness and is probably out of reach for most community skill authors. The realistic production target is tested: structured postcondition declarations, automated test suites that verify them, and Ed25519 signatures over the complete skill artifact. Enclawed demonstrates the architecture is buildable.
The trust gap won’t close itself. Whether the ecosystem converges on formal verification before a high-profile supply-chain incident forces the issue is the only open question.
Frequently Asked Questions
What does migrating from a SKILL.md-only registry to the enclawed model actually require?
Beyond Ed25519 signing, enclawed layers in a DLP scanner and Bell-LaPadula mandatory-access-control labels on every invocation — each skill’s output is classified and propagation rules restrict where derived data can flow downstream. Teams must define classification levels for their data domains and instrument side-effect capture at every integration point the runtime touches.
How do Metere’s four levels differ from Xu and Yan’s T1–T4 gates?
Metere’s schema is permission-gating: verification level determines which runtime capabilities a skill can exercise. Xu and Yan’s T1–T4 is a registration gate: trust tiers determine whether a skill gets listed in a marketplace at all. Both converge on deny-by-default but enforce at different points in the skill lifecycle — one at execution time, the other at publish time.
What cross-model generalization rates did the adversarial evaluation actually achieve?
The 27,000-run evaluation (27 cells × 200 seeds across multiple model architectures) reported 76.3% recall at 0.0% false-positive rate in the open-source configuration and 80.2% recall at 0.6% FPR in full deployment. The gap between the headline zero FP/FN and ~80% recall across unseen models means roughly one in five adversarial skill behaviors would evade detection outside the evaluation ensemble.
Which attack scenarios does the Li et al. taxonomy flag that the trust schema does not directly mitigate?
Li et al.’s 17 scenarios span three attack layers, including injection through corrupted skill descriptions where no data-instruction boundary is enforced. Ed25519 signing prevents post-publication tampering but cannot stop a legitimately signed skill from embedding malicious payloads — the taxonomy identifies this as a structural gap, not a supply-chain one.
What happens when a verified skill calls an unverified downstream dependency?
Enclawed applies Bell-LaPadula classification propagation: if a tested-level skill invokes an unverified dependency, the result is downgraded to match the lowest trust tier in the call chain. This prevents a verified wrapper from laundering unverified behavior through transitive invocation, though the mechanism has not been independently reproduced outside the author’s own 207-test suite.