Pydantic AI v1.87 Closes the LangGraph Gap: Deferred Tool Calls, OpenTelemetry Eval, Stateful Compaction

Pydantic AI shipped five releases[^1] between April 15 and 24, 2026 (v1.83 through v1.87), adding inline human-in-the-loop pauses, production observability via OpenTelemetry, and server-managed long-context state compression. These three capabilities had been the most common reasons to reach for LangGraph instead. That framing is an analytical assertion, not a cited survey, but it maps closely to what practitioners actually complain about when Pydantic AI comes up in framework comparisons.

What Pydantic AI Shipped in Ten Days

Three of the five releases carry meaningful capability changes. [v1.84.0][^2] (April 16) added stateful compaction mode to OpenAICompaction and Claude Opus 4.7 support. [v1.85.0][^3] (April 21) added online evaluation via OpenTelemetry events. [v1.87.0][^4] (April 24) added HandleDeferredToolCalls and the handle_deferred_tool_calls hook for inline resolution of deferred tool calls during a run without halting it.

The timing is not obviously coordinated. These are pull requests from different contributors merging on different days, but the net effect across the ten-day window is a coherent feature cluster.

HandleDeferredToolCalls: Human-in-the-Loop Without Stopping the World

The standard pattern for human-in-the-loop in agent frameworks requires halting a run when a tool call needs approval, surfacing the pending calls to a UI layer, collecting input, then restarting execution with the result injected. The interruption is architectural: you either build the pause/resume machinery yourself or you use a framework that provides it natively.

[HandleDeferredToolCalls][^5] wraps a user-provided handler that can approve, deny, or supply results for pending tool calls inline during a run. Unresolved calls surface as DeferredToolRequests output for UI adapters that need to present them to a human. It is positioned as the recommended primary method for deferred tools.

LangGraph’s human-in-the-loop support has historically required adopting its graph-state execution model to get the persistence that makes pause/resume work. Whether HandleDeferredToolCalls is equivalent in durability for long-running agents is not yet established; the feature is four days old.

OpenTelemetry Online Evaluation: Production Observability Built In

[v1.85.0][^3] adds an @evaluate decorator, an OnlineEvaluation [capability][^6], and gen_ai.evaluation.result OTel events that parent to the referenced span. The design follows OTel semantic conventions: evaluation results emit as events on the trace rather than routing to a sidecar data store or a separate API call.

Where LangSmith routes eval results to LangChain’s hosted tracing platform, Pydantic AI’s approach emits to whatever OTel-compatible backend you’re already running: Honeycomb, Grafana Tempo, Datadog, or a self-hosted Jaeger. For teams already instrumented with OTel, eval results flow into existing dashboards without an additional integration.

The tradeoff is tooling depth. LangSmith is a purpose-built eval and tracing UI; an OTel backend is a general-purpose observability store. Teams that want dataset management, comparison runs, and eval-over-time visualization built in are not getting that from gen_ai.evaluation.result events alone.

Stateful OpenAICompaction: Long-Context State That Actually Works

[v1.84.0][^2] makes OpenAICompaction() stateful by default. Previously, bare OpenAICompaction() ran in stateless mode with message_count_threshold=10: client-side, threshold-based, predictable in behavior if not always in output quality. The new default uses OpenAI’s server-side auto-compaction via context_management on /responses requests, with the server managing the threshold.

This is a breaking change. Any code using bare OpenAICompaction() and relying on the stateless behavior needs to opt back in explicitly. The v1.84.0 PR does not document a deprecation warning, so existing users will hit this silently on upgrade.

The scope limitation is significant: OpenAICompaction is OpenAI-specific. It delegates state management to OpenAI’s /responses API, which means it provides nothing for Anthropic-hosted models, open-source deployments, or anything outside the OpenAI API surface. Claude Opus 4.7 support arrived in the same release, but the compaction path for Anthropic models is a separate problem v1.84.0 does not address.

What This Changes for the Pydantic AI vs. LangGraph Decision

The case for LangGraph on greenfield Python agent work has rested on three capabilities: human-in-the-loop state management, production tracing, and durable long-context execution. [LangGraph positions itself][^7] as a low-level orchestration framework built around those primitives, with enterprise adoption at Klarna, Uber, and J.P. Morgan.

Pydantic AI now has an answer in all three categories. The answers are narrower in some cases (OTel observability vs. LangSmith’s purpose-built eval UI; OpenAI-specific compaction vs. a model-agnostic solution) and less proven across the board. For a team starting fresh in April 2026, the decision no longer falls on a feature checklist. It depends on which tradeoffs matter more: Pydantic AI’s lighter dependency surface and faster iteration cadence, or LangGraph’s deeper ecosystem and longer production track record.

The dependency surface point is structural. LangGraph requires the LangChain stack. Pydantic AI’s core is pydantic and httpx. For teams that care about dependency auditing, that gap is not incidental.

What LangGraph Still Does Better

[LangGraph’s managed deployment infrastructure][^7] and LangSmith’s eval tooling have no equivalent in the Pydantic AI ecosystem yet. LangGraph provides checkpointing and persistence at the graph level: durable state that survives process restarts, not just tool-call approval hooks. Whether HandleDeferredToolCalls covers the same durability surface is not something a four-day-old PR can answer.

LangChain’s component library (document loaders, retrievers, vector store integrations) is not replicated elsewhere. Teams already built on LangChain components don’t face a rewrite decision because Pydantic AI added OTel events.

Pydantic AI compressed the capability gap significantly in ten days. It did not close the ecosystem gap, and production maturity takes longer than a PR merge. The frameworks are no longer obviously differentiated on core features. They are now differentiated on ecosystem depth, operational experience, and the specific requirements of the agent being built.

Frequently Asked Questions

What compaction options exist for Anthropic or open-source model users in Pydantic AI?

None shipped in this release cluster. OpenAICompaction delegates to OpenAI’s server-side /responses context_management endpoint, which has no equivalent in Anthropic’s or other providers’ APIs. Users on non-OpenAI models must implement their own client-side compaction — for instance, manually trimming message history above a token threshold before each model call. Claude Opus 4.7 model support arrived in v1.84.0 alongside stateful compaction, but Anthropic models received no compaction path.

How does HandleDeferredToolCalls differ architecturally from LangGraph’s interrupt()?

LangGraph’s interrupt() writes paused graph state to a configurable checkpoint store (Postgres, SQLite, in-memory), so the process can crash, restart, or be rescheduled hours later and resume from persisted state. HandleDeferredToolCalls keeps the agent run alive in-process while the handler resolves pending calls. If the Python process dies mid-run — a Kubernetes pod eviction, an OOM kill — that run is lost. The tradeoff is simplicity now versus crash-safe durability for long-running approval flows.

What does the OpenAICompaction silent breaking change actually look like in production?

No error is thrown. The failure mode is behavioral drift: server-managed compaction may summarize or truncate conversation history using OpenAI’s internal heuristics rather than the predictable client-side cutoff at 10 messages. Agent outputs can shift subtly — differently phrased summaries, dropped context from earlier turns — without triggering test suite failures because the agent still completes successfully. Pinning stateless=True on existing OpenAICompaction() instances preserves the old behavior, but teams need to know to do it.

Can the gen_ai.evaluation.result OTel events support regression tracking across agent versions?

Not directly. The events are structured as OTel span events following semantic conventions, meaning they attach to individual trace spans and are queryable in backends like Honeycomb or Grafana Tempo alongside other telemetry. But they lack first-class identity for cross-run or cross-version comparison — there’s no built-in dataset grouping, no baseline-diff primitive, and no notion of an eval suite that can be re-executed. Building regression detection requires layering custom tooling on top of the raw events.

What specific agent scenarios would still force a team toward LangGraph despite these releases?

Multi-day human review cycles are the clearest case. If a legal-compliance agent submits a tool call for human sign-off and the reviewer responds 48 hours later, the originating process is almost certainly gone. LangGraph’s checkpoint store makes that resume trivial. Similarly, agents running on preemptible infrastructure (spot instances, serverless workers with timeouts) need durability that in-process deferred resolution cannot provide. Managed deployment infrastructure — LangGraph’s hosted Cloud/Studio environment — also has no Pydantic AI equivalent.