Can an Attacker Steal Your Model's Last Layer From Its Outputs?

Yes. If the API exposes full logits or high-precision logprobs, an attacker can recover the projection matrix of a deployed transformer up to a rotation, using only query outputs. That result, demonstrated empirically by Carlini et al. (2024) for under $20 on OpenAI’s Ada model, now has a complete geometric proof. A new paper by Snigdha Chandan Khilar, submitted June 5, 2026, characterizes exactly what the attack recovers, why it is closed-form, and where it stops.

Why the Last Layer Is the Easy Target

The transformer’s last layer is a soft target because it performs a linear projection followed by normalization. Given enough query responses, the singular value decomposition of the output matrix reveals the hidden dimension directly. On a toy model with hidden dimension h=64, Khilar reports that the singular spectrum drops by 14 orders of magnitude at index 64, making the rank recoverable with no ambiguity.

Once the rank is known, the normalization layer constrains the geometry further. Layer normalization forces the logit vectors onto an ellipsoidal surface. Fitting that ellipsoid from query outputs narrows the projection-matrix recovery from “any invertible matrix” to “any rotation.” That residual rotation is genuinely unrecoverable from outputs alone, but it is also the only thing left unknown. The paper frames this using the language of exterior differential systems: the attainable logits are the integral variety of an ideal with one linear generator (the rank constraint) and one quadratic generator (the ellipsoid from normalization).

What the Geometric Framing Adds

The Carlini et al. attack worked. What the new paper provides is the proof of why it works and a precise boundary on what it can reach.

By recasting the problem in terms of polar spaces and Frobenius-integrable systems, the paper shows that the extraction is not approximate or heuristic. It is closed-form. The SVD step recovers the column space of the projection matrix. The ellipsoid-fitting step recovers the matrix up to orthogonal gauge freedom. There is no remaining parameter that more queries or better optimization could extract, because the rotation ambiguity is a structural property of the observation map, not a gap in technique.

The Identifiability Wall: Why Deeper Layers Stay Safe

This is the paper’s strongest result. Khilar proves that beneath the last layer, most sublayer parameters live in a non-identifiable fiber: different nonlinear sublayers, and even architectures with different widths, can produce bit-identical outputs. This is not a limitation of the attack technique. It is a property of the observation map itself.

The consequence is specific and narrow. Query-based logit extraction cannot be extended past one layer. The parameters of the transformer’s body, its attention heads, feed-forward blocks, and intermediate representations, are provably out of reach for this class of attack. The paper states this explicitly: the identifiability wall is a consequence of the mathematical structure, not an engineering obstacle that a smarter algorithm could bypass.

This distinguishes last-layer extraction from other threat models. Side-channel attacks, gradient access, or direct weight theft operate under different assumptions and are not constrained by this boundary. The wall applies only to what an adversary can learn from observing outputs through an API.

Cost of Extraction in Practice

The dollar amounts are concrete. Carlini et al. extracted the full projection matrix (hidden dimension 1024) of OpenAI’s Ada model for under $20, and the full projection matrix (hidden dimension 2048) of the Babbage model for a comparable cost. For gpt-3.5-turbo, estimated embedding dimension 4096 per Finlayson et al., they estimated the extraction would cost under $2,000 in API queries.

Finlayson et al. also showed that LLM outputs function as a unique model signature, enabling detection of unannounced model updates. They found no obvious defense that would not either alter the architecture or make the API considerably less useful.

Complementary Leaks: Precision Detection From 20 Logprobs

The last layer is not the only thing leaking. An ICLR 2026 blog post demonstrates that just 20 logprobs suffice to infer a model’s internal floating-point precision. The technique distinguishes FP32, BF16, FP16, and FP8, and the authors report evidence that older OpenAI models (GPT-3.5, GPT-4) use FP32 logits while newer models (GPT-4o, GPT-4.1) use BF16.

This is a separate information leak from projection-matrix extraction, but it compounds the problem. An attacker who can determine both the projection matrix and the numerical precision of a proprietary model has extracted proprietary details without ever accessing the weights. The blog post notes that adding random noise to logprobs is a simple defense, though it degrades output quality.

What Defenses Actually Work

The 2023 ACM Computing Surveys systematic review of model-stealing attacks cataloged defense strategies and found that many are rendered less effective by current attack techniques. The defenses that remain relevant for last-layer extraction fall into three categories:

Reduce output precision. Rounding logprobs, top-k truncation, or adding calibrated noise all raise the number of queries required for extraction. The precision-detection result suggests that even coarse rounding to BF16 does not prevent the leak, so the noise floor needs to be higher than the rounding error.

Remove full logit access. Returning only the argmax label, or a small number of top probabilities with reduced precision, removes the rank information that SVD exploits. This is the most effective defense but also the most disruptive to downstream use cases that depend on calibrated probabilities.

Accept the loss and harden what remains. The identifiability-wall result gives this strategy a formal basis. If the last layer is treatable as compromised, the security boundary shifts to protecting the transformer body. That means defending against side-channel access, weight exfiltration, and training-data extraction, which are different threat models with different mitigations.

The geometry paper’s contribution is not a new attack. It is a proof of where the boundary of extractable information sits, and why that boundary is rigid. For API providers, the practical implication is blunt: if you expose full logits or high-precision logprobs, assume the unembedding layer is known to any determined adversary. And for everyone wondering whether the same attack can climb deeper into the model, the answer is provably no, at least through this vector.

Frequently Asked Questions

Does the extraction still work if the API returns only top-k logprobs instead of the full vocabulary?

The SVD step requires a matrix whose columns span the full output dimension. A top-k truncated response gives the attacker a matrix that does not span the projection column space, preventing rank recovery. The 2023 ACM Computing Surveys categorization separates “labels only” attacks from “full probabilities” attacks; this extraction falls firmly in the latter. However, APIs that withhold logprobs entirely remain vulnerable to model distillation, where an adversary trains a functional replica from input-output text pairs without ever recovering exact weights.

How does last-layer extraction differ from model distillation as a theft strategy?

Distillation trains a separate architecture on the target model’s outputs, producing a functional clone whose fidelity depends on query budget and architecture choice. It works even when the API returns only generated text, no logprobs required. The geometric extraction, by contrast, recovers the exact projection matrix up to a rotation of the deployed model, but stops cold at the identifiability wall. For an adversary, distillation is broader but approximate; last-layer extraction is narrow but exact.

What happens to extraction accuracy when providers add noise to logprobs?

The SVD rank-recovery step depends on a clean spectrum gap (the 14-order-of-magnitude drop reported for h=64). Noise that exceeds the storage format’s rounding error degrades that gap, forcing the attacker to issue far more queries to separate signal from noise. But the precision-detection result shows that even BF16 rounding does not mask a model’s floating-point signature; 20 logprobs still distinguish FP32 from BF16 from FP8. The defense that raises the noise floor high enough to disrupt SVD is also the one that most degrades calibrated probability outputs for downstream tasks.

Can the identifiability wall be bypassed by combining logit extraction with another attack vector?

The wall is proven for the black-box, outputs-only observation map. If an adversary gains gradient access through a fine-tuning API, the threat model changes entirely and deeper layers become recoverable. Side-channel attacks exploiting timing, memory access patterns, or power consumption can also extract weight information from any layer, unconstrained by the identifiability boundary. Providers that expose both logprobs and fine-tuning endpoints present a wider attack surface than those offering text generation alone.

How many API queries does the SVD step require at minimum?

The decomposition needs enough query response vectors to populate a matrix whose column rank equals the hidden dimension h. At minimum, h linearly independent responses are required, and in practice several multiples of h are needed to produce a clean spectrum gap. For h=1024 (OpenAI’s Ada), this means low thousands of queries. For h=4096 (estimated for gpt-3.5-turbo), the query count scales proportionally, consistent with Carlini et al.’s under-$2,000 cost estimate at 2024 API pricing. The query cost is linear in h, not exponential, which is what makes the attack affordable.