groundy
infrastructure & runtime

Pod-Level Remote Attestation in Kubernetes: Confidential Workloads on dstack

dstack-capsule binds pod identity into Intel TDX hardware quotes, enabling multi-pod confidential VMs without the per-VM density tax of Confidential Containers.

10 min···6 sources ↓

Confidential computing on Kubernetes has a granularity problem. Current Confidential Containers (CoCo) deployments attest the virtual machine, not the workload inside it, and they incur prohibitive per-VM resource overhead for the privilege. A preprint from researchers at OPPO and Phala proposes a different trade: share a single Intel TDX confidential VM across multiple pods, but bind each pod’s identity directly into hardware-signed attestation quotes.

The Granularity Problem

As of 2026, Confidential Containers, the CNCF-incubating project built on Kata Containers, works by launching each pod inside its own microVM with confidential-computing flags set. The VM’s measured boot chain proves the guest OS booted correctly. What it does not prove is which container image is running, which pod spec was admitted, or what workload identity the pod claims. The attestation boundary stops at the hypervisor-facing VM metadata, which is exactly the boundary a compromised kubelet or scheduler can tamper with.

On the resource side, the prohibitive per-VM resource overhead of each Kata microVM makes density the binding constraint for multi-tenant clusters running confidential LLM inference or regulated data processing.

As of 2026, managed cloud offerings from Azure, GKE, and AWS have stable interfaces for confidential containers, but they inherit the same one-pod-per-VM model. Standard runc containers remain unprotected even when SEV-SNP or TDX hardware is available on the node.

Two-Layer Attestation in dstack-capsule

The dstack-capsule paper splits attestation into two layers that run on a single shared TDX confidential VM.

Static layer. The platform’s measured boot measurements are frozen into RTMR[3], one of the TDX runtime measurement registers. This covers the guest kernel, the initrd, and the dm-verity-protected OS image. Once frozen, these values cannot change for the lifetime of the VM.

Dynamic layer. On every attestation request, each pod’s identity is packed into the TDX Quote’s 64-byte report_data field, signed by the CPU. The field contains the pod_uid, a pod_spec_hash, and a workload_id. A relying party can verify not just that a confidential VM is running, but that a specific pod with a specific spec digest is running inside it.

The key difference from CoCo’s approach: because TDX quotes are per-request and per-pod, multiple pods can coexist in the same VM without diluting the attestation signal. Each pod gets its own hardware-backed proof of identity, and a co-resident pod cannot forge that proof because it cannot write to another pod’s report_data field.

The Privilege Fuse

dstack-capsule introduces a mechanism the authors call the privilege fuse: an irreversible, atomic state transition that moves a Kubernetes node from a privileged setup phase to a locked-down runtime phase. The implementation uses a compare-and-swap operation plus a persistent marker file.

Before the fuse is blown, privileged pods (those requesting hostNetwork, hostPID, hostIPC, hostPath, or the privileged security context flag) can be scheduled. After the fuse is blown, the admission controller rejects all such pods and RTMR[3] is frozen. There is no unfuse operation. The admission controller is the enforcement point, and admission policy is notoriously easy to get subtly wrong; syntactically valid manifests that quietly violate cluster intent are a known failure class, and here the cost of a missed rule is a permanent privilege window rather than a misconfigured deployment.

This is a design pattern worth naming because it solves a real operational problem. Confidential VMs need a setup phase where the node joins the cluster, pulls images, and configures networking. Without a fuse-like mechanism, that setup phase is a permanent privilege window. With it, the window closes once and cannot be reopened without reprovisioning the entire VM.

The operational cost is real: a misconfigured fuse locks you out of node maintenance without reprovisioning. Platform teams running dstack-capsule would need to treat node lifecycle as immutable after fuse-blow, similar to how Flatcar Container Linux handles updates via partition swaps rather than in-place mutation.

The Multi-Layer Sandbox

dstack-capsule does not rely on attestation alone. The implementation, built on Kubernetes 1.32 with Intel TDX and Sysbox, layers several containment mechanisms:

The choice of Sysbox over Kata is what enables the shared-VM model. Sysbox provides user-namespace-based isolation at container granularity without launching a separate VM per pod, which is why the memory footprint stays closer to a standard Kubernetes deployment than to CoCo’s dedicated VM per pod.

dstack the Framework, dstack-capsule the Extension

It is worth separating two things that share a name, because the June announcement blurs them. dstack is a shipping open-source confidential-computing runtime, a Linux Foundation Confidential Computing Consortium project maintained largely by Phala, with a published third-party security review. Its model is deliberately coarse: a workload runs as a docker-compose.yaml inside a full confidential VM, disk state and network traffic encrypted by default, and the whole VM is the attestation unit. It runs on Intel TDX (4th and 5th-gen Xeon) and AMD SEV-SNP hosts, and chains in NVIDIA Confidential Computing on H100 and later GPUs for accelerated workloads.

dstack-capsule is the research extension described in the OPPO and Phala paper, not a released product. It keeps the full-VM TDX boundary but subdivides it: multiple pods share one confidential VM, and each pod carries its own hardware-signed identity through report_data. The distinction matters for anyone reading the press. The framework you can deploy today attests at VM granularity, the same coarse boundary as CoCo. Pod granularity is the part that is still a paper, and the table below compares that paper, not the shipping runtime.

CoCo vs. dstack-capsule

DimensionConfidential Containers (CoCo)dstack-capsule
Isolation unitOne pod per Kata microVMMultiple pods per TDX VM with Sysbox
Memory per podDedicated VM per pod with prohibitive overheadShared VM pool, no per-pod VM overhead
Attestation targetGuest OS boot measurementsPod UID + spec hash + workload ID in TDX Quote
Attestation granularityVM-levelPod-level
MaturityCNCF incubating; Azure, GKE, AWS offeringsarXiv preprint, research prototype
HardwareAMD SEV-SNP, Intel TDXIntel TDX only

The density advantage is straightforward: dstack-capsule’s shared-VM model should accommodate more pods per node than CoCo because each pod is a Sysbox container, not a full VM. The exact density multiplier depends on workload memory requirements, which the preprint does not benchmark in production-scale deployments.

Confidential LLM Inference Needs the GPU Attested Too

The paper’s recurring example is confidential LLM inference, but the pod-level scheme attests CPU state. report_data lives in a TDX Quote signed by the CPU, and RTMR[3] covers the guest boot chain. Neither says anything about what runs on an attached accelerator. Model weights and activations that touch GPU memory sit outside that boundary unless the GPU is itself a TEE.

That is a solved problem in principle and a moving target in practice. NVIDIA Confidential Computing on H100 and Blackwell encrypts GPU memory and produces its own attestation report, which a relying party has to verify alongside the CPU quote. The production dstack framework already integrates NVIDIA CC; the capsule paper does not extend pod-level identity into the GPU’s attestation report. A deployment that wants per-pod proof and confidential GPU compute therefore has to stitch two attestation chains together and trust the binding between them. For inference on a shared accelerator that binding is exactly where a careful review belongs, because a pod-level CPU proof tells you nothing about which pod’s tensors a co-resident workload can read off the card.

There is also a separate question the hardware quote does not answer: attesting where a model ran is not the same as attesting which model ran. A valid TDX quote proves a specific pod spec executed in a genuine TEE, not that the bytes it served match the model you approved. Bit-exact inference verification attacks the second problem, and a regulated confidential-inference stack arguably needs both.

What This Means for Multi-Tenant Clusters

The second-order effect of pod-level attestation lands on secrets injection and compromised-node containment.

In current CoCo deployments, secrets are typically delivered to the VM during boot, before the workload starts. The attestation verifies the VM booted correctly, then the secrets are released. But if attestation only covers the VM and not the pod, a compromised kubelet could schedule a different pod on the same node and the VM-level attestation would still pass. The secrets would be available to the wrong workload.

dstack-capsule’s model binds the pod spec hash into the CPU-signed quote. A relying party (a KMS running in an independent TEE) can verify that the exact pod spec it approved is the one requesting the secret. A co-resident pod with a different spec hash cannot impersonate the attested pod because it cannot produce a valid TDX quote with the target’s pod_uid and pod_spec_hash.

The threat model assumes simultaneous collusion between the cloud platform operator (who controls the hypervisor, host OS, and Kubernetes control plane) and the pod developer (who may try to extract user data or escalate privileges). Trust is placed in the Intel TDX hardware and microcode, the dm-verity-protected OS image, and the independently-attested KMS. If any of those layers breaks, the model breaks with it.

For platform teams evaluating confidential Kubernetes, the question is not whether dstack-capsule is production-ready today (it is not). The question is whether pod-level attestation becomes a requirement as regulated workloads move onto shared infrastructure. If it does, the two-layer architecture, with static platform measurements and dynamic pod identity, is a plausible design for getting there without the density tax of one VM per pod.

Frequently Asked Questions

Does dstack-capsule work on AMD SEV-SNP or ARM CCA hardware?

No. The implementation relies on Intel TDX-specific primitives, including RTMR registers and the 64-byte report_data field in TDX Quotes, with guest support that merged in Linux 5.19. Porting to AMD SEV-SNP or ARM CCA would require mapping those primitives to each platform’s attestation structures. CoCo’s vendor-neutral Kata abstraction is what gives it multi-platform support across SEV-SNP and TDX, which dstack-capsule trades away for pod-level attestation depth.

How many confidential pods can a single node run under CoCo versus dstack-capsule?

CoCo’s Kata microVMs require approximately 2 GB of dedicated memory per pod, limiting a 64 GB host to roughly 30 confidential pods before memory exhaustion. dstack-capsule’s Sysbox-container model shares a single TDX VM, so per-pod memory overhead is determined by the workload itself rather than a fixed VM allocation. The preprint provides no production-scale density benchmarks, so the actual improvement ratio remains unvalidated.

What attack vectors does the dstack-capsule threat model explicitly exclude?

Side-channel attacks (cache timing, power analysis) and Intel microcode vulnerabilities are explicitly out of scope. The trust chain depends on TDX hardware integrity, a dm-verity-protected OS image, and an independently-attested KMS running in a separate TEE. A microcode-level compromise bypasses all three layers. The codebase itself, approximately 7,700 lines of Rust and 660 lines of Go, has no production deployment history.

How does the Springer 2025 pod-integrity approach differ in its trust root from dstack-capsule?

The Springer paper uses TPM hardware roots of trust to verify node-level integrity and detect unauthorized pod modifications. dstack-capsule embeds pod identity directly into CPU-signed TDX Quotes rather than measuring through a TPM. The practical implication: TPM-based approaches run on most server hardware shipped since 2016, while dstack-capsule requires Intel TDX-capable processors (4th-gen Xeon Sapphire Rapids and later), restricting the eligible node pool.

What happens to co-resident pods if a TDX microcode vulnerability is disclosed?

A quote-integrity break would collapse the entire attestation chain, because pod identity, privilege fuse state, and secret-release decisions all depend on TDX Quotes being trustworthy. Both CoCo and dstack-capsule would lose attestation guarantees, but CoCo’s Kata VM boundary provides isolation that does not depend on attestation correctness. dstack-capsule’s shared-VM model has a smaller isolation boundary between pods, so a TDX break would expose co-resident workloads to each other more directly than CoCo’s air-gapped microVMs.

sources · 6 cited

  1. Extending Kubernetes for Pods Integrity Verificationlink.springer.comanalysisaccessed 2026-06-06
  2. dstack: Open-Source TEE Runtime for Confidential AIphala.comvendoraccessed 2026-06-26