Learning to Configure Agentic AI Systems Exposes a Gap in CrewAI and AutoGen Template Libraries

Most agentic AI frameworks ship with template libraries: pre-built crews in CrewAI, conversation presets in AutoGen, example graphs in LangGraph. The implicit assumption is that a human selects or authors the right configuration for a given task. ARC (arXiv:2602.11574)¹, updated to its third revision on May 21, 2026, tests that assumption and finds it wanting. A lightweight learned policy that selects per-query agent configurations beats budget-matched static setups by +31.3% on reasoning accuracy and +13.95% on tool-use accuracy¹, and doubles τ-Bench Airline Pass from 9.0% to 18.0%¹. If those numbers hold under broader replication, template curation stops being a competitive advantage and starts being a liability.

What ARC Actually Does: Per-Query Config as a Learned Policy

ARC reframes agent configuration as a semi-Markov decision process (SMDP)¹. Each incoming query triggers a hierarchical policy that selects a configuration “option”: a temporally extended action specifying the workflow topology, tool set, token budget, and prompt template the agent system will use for that query. The configuration space is combinatorially large. The policy learns to navigate it.

The paper is explicit about what it replaces. Current agent configuration is, in its words, “typically handled today by fixed templates or hand-tuned heuristics that apply the same configuration regardless of query difficulty, leading to brittle behavior and wasted compute.” That description maps directly onto how most practitioners use CrewAI (pick a crew template, wire agents into roles), AutoGen (configure a GroupChat with fixed participant roles), or LangGraph (clone an example graph and edit nodes). The human picks once; the system runs that config for every query.

ARC’s move is to make the pick per-query and learned rather than per-deployment and authored.

The Numbers: 31% Reasoning, 14% Tool-Use, 2× τ-Bench¹

The headline results from ARC v3¹:

Metric	Improvement over budget-matched baseline
Average reasoning accuracy	+31.3%
Tool-use accuracy	+13.95%
τ-Bench Airline Pass	9.0% → 18.0% (2×)

These are not marginal gains. A doubling on τ-Bench, in particular, is notable because τ-Bench tests multi-turn agentic tasks with realistic tool-use constraints: the agent must manage a simulated airline booking system with policy-compliant reasoning across conversation turns. Going from 9% to 18%¹ is still a low absolute number, which says more about the difficulty of the benchmark than about ARC’s ceiling.

The benchmarks cover reasoning, tool use, and agentic tasks. The paper does not benchmark on real-world multi-agent crew deployments of the kind CrewAI or AutoGen users actually run. Extrapolating from single-agent or simplified multi-agent benchmarks to production multi-agent workflows requires caution.

Why Static Templates Break Down Under Query Diversity

A static template assumes query homogeneity. In practice, an agent system fielding diverse requests encounters easy lookups, multi-step reasoning chains, tool-heavy operations, and conversational state-tracking tasks, often interleaved within the same session. A configuration optimized for one of these is suboptimal for the others.

The Declarative Data Services paper (arXiv:2605.20690)² documents a related failure mode from a different angle: unbounded agentic discovery, where a coding agent iterates on its own failure logs without structured constraints, fails to converge consistently. The fix the paper proposes, structured declarative contracts at successive layers, decomposes the search into bounded sub-problems that do converge.

ARC and DDS arrive at overlapping conclusions from different directions. ARC says: let a learned policy pick the config. DDS says: constrain the search space with declarative contracts. Both reject the premise that a human should hand-author a single static configuration and expect it to generalize.

A separate line of work, Agentic Agile-V (arXiv:2605.20456)³, argues that “the central problem is no longer prompt engineering; it is engineering process control,” proposing a SCOPE-V loop that converts conversational intent into structured engineering artifacts. The common thread: the manual configuration layer in agentic systems is under pressure from multiple research directions simultaneously.

CrewAI vs AutoGen vs LangGraph: Whose Config Surface Is Learnable?

This is where the ARC results become practical. If per-query learned configuration outperforms static templates, the framework that exposes the cleanest learnable configuration surface has a structural advantage. Frameworks whose configuration knobs are buried in Python class hierarchies, rather than represented as serializable manifests, will be harder to plug into learners like ARC.

CrewAI defines agents, tasks, and crews primarily through Python classes with optional YAML serialization. A Crew object composes Agent and Task objects; the YAML export exists but is a serialization of the class graph, not a first-class declarative schema. The configuration surface (role definitions, tool assignments, process types, memory settings) is spread across multiple class constructors.

AutoGen (Microsoft) takes a code-first approach. GroupChat configurations specify participant agents, speaker-selection functions, and max rounds as Python arguments. There is no standard declarative manifest format; the configuration lives in imperative Python. [unverified, based on pre-2026 documentation; API may have changed]

LangGraph, built on LangChain, defines agent workflows as stateful graphs. Nodes are functions; edges can be conditional. The graph structure is defined in Python but is conceptually close to a declarative state machine. LangGraph’s graph-definition API represents the workflow topology as a serializable object graph, which is closer to the kind of structured config surface ARC would consume.

SOLAR (arXiv:2605.20189)⁴ adds another data point: parameter-level meta-learning with multi-level RL enables agents to self-improve and adapt to unseen domains without catastrophic forgetting. SOLAR operates at a different layer than ARC (parameter adaptation vs. configuration selection), but both assume the system can improve at runtime rather than requiring a human to re-author its setup.

From Templates to Learners: What Framework Authors Should Do Next

If ARC-style learned configuration becomes standard, framework competition shifts from “who ships more templates” to “whose config surface is machine-readable, serializable, and granular enough for a policy to navigate.” Three concrete implications:

Expose configuration as a first-class declarative schema. YAML or JSON manifests that fully specify an agent system’s workflow, tools, prompts, and budgets, independent of the host language’s class hierarchy. CrewAI’s YAML serialization is a start but currently mirrors the Python object graph rather than serving as an independent specification.
Make the configuration space explicit and bounded. ARC’s SMDP formulation requires a discrete set of configuration options to select among. Frameworks that present configuration as an unbounded set of Python kwargs, where any string-valued argument is valid, offer a larger search space but a harder learning problem. The DDS paper’s² finding that structured contracts improve convergence is relevant here.
Instrument the config-to-outcome mapping. To train a learned policy, you need data linking configurations to outcomes per query type. Frameworks that log which configuration was used, what query it served, and what the result was (success, failure, latency, cost) produce the training data ARC-style learners need. Most current frameworks log agent actions but not the configuration that produced them.

None of this requires adopting ARC specifically. The broader shift is conceptual: agent configuration is becoming an optimization target, not a design decision. Frameworks that treat configuration as a static authoring problem will find themselves competing on template libraries against systems that learn the right configuration per query. The evidence from ARC v3 suggests the learning approach has legs.

Frequently Asked Questions

Does ARC carry session state across queries from the same user?

ARC’s SMDP formulation treats each query as an independent decision point, selecting a configuration option without forwarding prior choices into the next selection. Production systems that need session-aware routing, e.g., escalating from a lightweight lookup config to a reasoning-heavy one mid-conversation, would need to layer a session tracker on top of ARC’s per-query policy. The paper does not evaluate multi-turn config adaptation within a single conversation thread.

What happens to an ARC-trained policy if you swap the underlying LLM provider?

ARC’s learned associations between configurations and outcomes are conditioned on the target model’s behavior. Swapping the underlying LLM (say, from GPT-4 to Claude) changes how a given prompt template, token budget, or tool sequence performs, invalidating the policy’s learned mapping. The policy would require retraining on config-outcome pairs collected under the new model, a cold-start cost the paper does not quantify.

How does SOLAR’s self-improvement layer differ from ARC’s approach?

ARC leaves model weights untouched and optimizes only the configuration wrapper, workflow topology, tool set, prompts, budgets. SOLAR uses parameter-level meta-learning with multi-level reinforcement learning to adapt the model’s own weights for unseen domains while guarding against catastrophic forgetting. They are complementary: ARC selects the right config for a query, SOLAR adapts the model underneath. Stacking both is theoretically possible but no published work has tested the combination.

What’s the overfitting risk for a policy trained on ARC’s benchmark suite?

ARC v3 reports results on reasoning, tool-use, and τ-Bench tasks, all structured evaluative benchmarks with well-defined success criteria. The paper does not test on open-ended creative tasks, domain-specific workflows like legal or medical agent systems, or adversarial inputs. A policy that learns to exploit patterns in these benchmarks may not generalize to query distributions it was never trained on, replicating in learned form the same brittleness static templates exhibit when pushed outside their intended use cases.

How does the DDS paper’s convergence result constrain how large an ARC config space can get?

DDS demonstrates that unbounded agentic search, letting an agent iterate freely over failure logs, fails to converge, while decomposing the problem into layered declarative contracts produces reliable convergence. Applied to ARC, this implies the combinatorial configuration space cannot simply grow unchecked as new tools and prompts are added. Practitioners would need to structure the config space hierarchically (e.g., coarse-grained workflow families, then fine-grained tool subsets within each) to keep the SMDP’s learning problem tractable. ARC’s current benchmarks use a finite, curated config catalog; scaling to user-defined plugin ecosystems may require the kind of bounded sub-search decomposition DDS prescribes.