A paper posted to arxiv on 22 April 2026 builds an LLM agent that iterates over data visualisations and emits a structured JSON rationale at every step, naming exactly which hyperparameter failed, why, and what the fix should be. The schema is explicit enough that a downstream tool could reconstruct the agent’s decision trace. Neither CrewAI 1.14.3 nor AG2 v0.12.1, both released 24 April, store anything equivalent.
What the Paper Actually Builds
Arxiv v21 frames the problem as a loop: generate a chart, score it, explain the failure, adjust parameters, repeat. The interesting part is the schema the agent writes at each iteration. Each step produces a JSON object containing a quality_score, a score_rationale, an overall_assessment block with key_strengths, key_weaknesses, and metric_analysis sub-fields, a dendrogram_comparison, a visual_inspection block with an artifacts list, and a recommendations array where each entry specifies parameter, current_value, suggested_value, rationale, expected_impact, and priority.
That priority field matters more than it looks. The agent ranks its own proposed changes before applying them, which gives a downstream consumer a diff-like view of intent: the agent believed this was the most important fix before it tried it, and that was the expected outcome. If the next iteration’s score doesn’t improve, you have a named hypothesis that failed: not a message log you have to re-parse to figure out what the agent was thinking.
The chain-of-thought constraint is structural: the agent must populate the visual_inspection fields before it generates recommendations. That sequencing enforces observation before action, which is less a novel technique than a prompt-engineering discipline baked into the schema.
Why CrewAI and AutoGen Can’t Ingest the Trace Today
CrewAI’s memory system2 stores facts using composite scoring (semantic similarity, recency, importance) with adaptive-depth recall. Its documented event types include MemoryQueryStartedEvent and MemorySaveCompletedEvent. CrewAI Flows3 expose state as an unstructured dict or a Pydantic BaseModel. None of that has a slot for score_rationale, expected_impact, or anything analogous to a per-step intent record.
CrewAI 1.14.34, released 24 April, adds checkpoint lineage and lifecycle events; enriched LLM token tracking arrived in 1.14.2a25. The release notes contain no mention of rationale fields, intent tracking, or structured reasoning traces.
AG2 v0.12.16, also released 24 April, added step events on top of the Observer API introduced in v0.12.07. The existing OpenTelemetry tracing already wraps spans around team runs and tool calls. The AutoGen tracing documentation8 describes span-level visibility into execution but nothing equivalent to a per-step rationale block, a structured intent-mismatch delta, or a priority-ranked recommendation list.
Both frameworks can tell you what the agent did and how many tokens it spent. Neither can tell you why the agent changed its mind between step 3 and step 4, in terms a downstream tool can parse without re-reading the full message transcript.
The Token-and-Storage Tax Nobody Measured
The paper reports iteration counts for convergence but does not measure token usage or API cost for emitting the full JSON diagnostic1 at each step. That’s a notable omission.
The schema is not small. Each iteration emits quality_score, score_rationale, overall_assessment with three sub-fields, dendrogram_comparison, visual_inspection with an artifacts list, and a recommendations array where each entry carries six fields. Serialised, that runs several hundred tokens per iteration. In an agent that runs 10-20 refinement cycles, the rationale payloads alone could equal or exceed the token cost of the chart-generation calls themselves.
Storage compounds the problem at scale. A dashboard pipeline generating hundreds of charts while persisting full rationale JSON per iteration faces a different storage budget than one logging message histories. Neither the paper nor the two framework releases give practitioners a number to plan against.
What Framework Authors Would Have to Add
Porting this pattern to CrewAI or AG2 requires schema, not just hooks.
For CrewAI, that means a per-step state entry that survives across Flow state transitions and carries typed fields: at minimum step_index, quality_score, rationale, and a recommendations array. Flow state today accepts arbitrary Pydantic models, so the mechanism exists, but there’s no standard schema, which means every team building iterative-refinement agents defines its own, and cross-agent observability remains impossible by default.
For AG2, the step events added in v0.12.1 and the Observer API introduced in v0.12.07 are structurally closer. Step events fire at iteration boundaries. But “step event” and “rationale block” are different things: an event signals when, a rationale block captures why. A team could attach a custom payload to an Observer event today, but they’d be inventing a schema that won’t interoperate with any downstream tool expecting a standard field.
Both frameworks would need a stable, published schema for per-step reasoning traces before this pattern becomes portable, stable enough to version against.
When This Matters (and When It Doesn’t)
The paper’s context is visualisation hyperparameter tuning: chart type, axis scaling, colour binning, dendrogram layout. The agent iterates over a bounded parameter space where intent is relatively well-defined: the user specified a target, the chart either matched it or didn’t. Intent drift here means hyperparameter mismatch, not an agent wandering from a user’s goal across a long multi-step workflow.
That scope matters for how far to generalise. The logging gap is sharpest in iterative-refinement loops where the agent explicitly optimises toward a stated target across multiple self-evaluating iterations. A CrewAI workflow running a linear sequence of tasks with no per-step self-evaluation has less need for per-step rationale fields. A long-running AutoGen GroupChat with many rounds of agent debate has more need, but the intervention point differs: you’d want rationale attached to proposal events, not chart-generation steps.
The paper’s implication for framework authors is genuinely useful for a specific class of agents. The broader claim that all agent frameworks lack structured reasoning traces is an overstatement the brief doesn’t support, and one worth resisting before it calcifies into conventional wisdom.
Frequently Asked Questions
How does this differ from intent-drift detection in coding agents?
The “Beyond the Diff” benchmark quantifies coding-agent intent drift but stops at detection, offering no machine-readable trace schema for downstream tooling. The arxiv paper’s contribution is the structured JSON diagnostic—versioned, queryable, parseable without transcript re-reading—which moves beyond drift detection into structured intent logging. The two share a root problem (agents silently changing direction) but address different layers of it.
Can AG2’s OpenTelemetry integration carry rationale payloads without framework changes?
Technically yes—OTel supports arbitrary span attributes, and AG2 already wraps spans around team runs and tool calls. A team could attach rationale JSON as a span attribute today. The constraint is that OTel backends such as Jaeger or Prometheus index flat key-value pairs, not nested recommendation arrays with priority fields, so cross-run comparison queries would require custom tooling layered on top of the trace backend.
Why isn’t CrewAI’s memory scoring a substitute for per-step rationale?
CrewAI’s composite scoring (semantic similarity, recency, importance) ranks which stored facts to retrieve—a retrieval-ranking heuristic, not an intent record. It answers “which memory is most relevant right now,” not “why did the agent change parameter X at step N.” The two mechanisms serve orthogonal purposes: one optimises recall, the other reconstructs decision provenance.
When will the version-specific claims in this analysis go stale?
The framework version numbers (CrewAI 1.14.3, AG2 v0.12.1) and the arxiv revision date (22 April 2026) are time-sensitive anchors that age with each new release. The structural claim—that neither framework defines a first-class per-step rationale schema—is evergreen until one of them ships one, which the 24 April releases confirmed has not yet happened. Practitioners citing this gap should pin their claims to the version numbers and re-verify after each framework release.