groundy
models & research

Cross-Domain RL Training Degrades Capabilities. CARE-RL Reweights to Fix It

CARE-RL shows that pooling math, code, and chat into one RL run causes silent capability erosion across domains, and proposes gradient subspace projection to reweight updates.

7 min · · · 3 sources ↓

The standard recipe for post-training a reasoning model is to pool math, code, and chat into a single reinforcement-learning run and let the reward signal sort everything out. CARE-RL (arXiv:2606.00609), submitted 30 May 2026 by Rui Zhang, argues that this recipe has a hidden cost: gradient updates that improve one domain can silently degrade another, because naive reward pooling produces credit-assignment conflicts that the training loop never detects. The paper proposes a capability-aware reweighting scheme to mitigate the interference. All reported results are the authors’ own figures on two Qwen checkpoints, with no independent replication.

The cross-domain interference problem

Single-domain RL post-training has a straightforward credit-assignment story: the reward signal is homogeneous, the task distribution is narrow, and gradients generally point in the same direction. Multi-domain RL breaks that assumption. When a batch contains math proofs, code generation, and open-ended chat, the reward signals live on different scales and encode different notions of “good.” A gradient step that improves math reasoning can shift the model’s representations in ways that hurt its conversational ability, and the training loop has no mechanism to notice.

CARE-RL frames this as a negative-transfer problem inside the optimization itself. The conflict is not between tasks in a multitask learning sense; it is between the gradient directions that different domains push the model toward. When those directions are misaligned, the dominant domain (typically the one with the highest reward variance or the largest batch share) wins, and the losing domains’ capabilities erode.

Direction-Aware Capability Subspace Projection (DACSP)

CARE-RL’s core mitigation is DACSP, which intervenes at the gradient level. It works in three steps:

  1. Extract historical capability directions. From prior RL stages, DACSP computes the principal gradient directions associated with each domain’s capability. These directions encode what the model “learned” for math, for code, for chat, and so on.

  2. Decompose the incoming gradient. For each new update, DACSP projects it onto the historical capability directions and classifies the components: aligned (the update pushes in the same direction as the domain’s prior learning), conflicting (the update pushes against it), and orthogonal (the update is in a new direction that does not interact with existing capabilities).

  3. Reweight by component type. Aligned components get amplified. Conflicting components get suppressed. Orthogonal components pass through unchanged.

This is a gradient-editing strategy, not a reward-shaping one. It does not change what the model is rewarded for; it changes how the reward signal propagates into weight updates. The distinction matters because reward shaping requires per-domain reward engineering, while DACSP operates on the geometry of the gradient space and is, in principle, domain-agnostic.

Protocol-Aware Generative Reward Model (PA-GRM)

The second component addresses a prerequisite problem: to detect cross-domain interference, you need reward signals that are comparable across domains in the first place. Verifiable tasks (math with a ground-truth answer, code with unit tests) have objective reward signals. Non-verifiable tasks (open-ended chat, instruction following) do not.

PA-GRM builds prompt-level evaluation protocols before generating rewards. Rather than scoring a response directly, the model first constructs a rubric-like evaluation plan tailored to the specific prompt, then produces a trace-conditioned reward against that plan. The intent is to make chat and instruction-following rewards as structured as math rewards, so that cross-domain gradient conflicts are detectable rather than buried in noise.

Whether PA-GRM actually produces rewards that are comparable across domains at the precision DACSP needs is an empirical question, and the paper’s self-reported benchmarks are the only evidence currently available.

Reported results and their caveats

According to the authors’ abstract, CARE-RL achieves Total Avg scores of 47.9 on Qwen2.5-7B and 50.7 on Qwen3-4B across benchmarks spanning math, chat, and instruction following, consistently outperforming standard multi-domain RL baselines. These numbers come from the paper’s own evaluation setup:

  • No independent replication exists as of 2026-06-03.
  • The model coverage is narrow: two Qwen checkpoints (7B and 4B parameter counts), both from the same model family.
  • No GPT, Claude, or Gemini baselines are reported, so the interaction between CARE-RL and proprietary model architectures is unknown.
  • Per-domain breakdowns are not available in the abstract, which means the practitioner cannot assess whether the Total Avg improvement comes from uniform gains across domains or from large gains in one domain masking continued degradation in another.

The last point is the one that matters for anyone considering implementing this. A Total Avg of 50.7 is compatible with math improving by 8 points and chat degrading by 3, or with all domains improving by similar margins. Without per-domain tables, the practitioner must replicate the experiment on their own model and data to learn whether the method helps their specific capability profile.

What this means for post-training pipelines

The practical takeaway is not “adopt DACSP” (the evidence is too thin for that). It is that multi-domain RL post-training has a cost that most pipelines currently treat as zero.

If you are running math, code, and chat through a single RL stage with a pooled reward, CARE-RL’s framing suggests you should measure per-domain capability drift during training, not just aggregate reward. The paper’s interference mechanism predicts that domains with lower reward variance or smaller batch shares will silently degrade. A lightweight diagnostic would be to track per-domain held-out accuracy every N steps and alert on divergence, which is cheaper than implementing full subspace projection.

For teams that do want to adopt capability-aware gradient editing, DACSP’s compute overhead is the open question. The historical capability directions require storing and updating principal-component representations for each domain, and the per-step projection adds a matrix operation to every gradient update. The paper does not report wall-clock or FLOPS overhead as of the available abstract.

Broader context: reward design and representational transfer

CARE-RL is not the only paper probing the assumptions underneath current RL-for-LLM practice. Two concurrent submissions reinforce different facets of the problem.

InfoMem (arXiv:2606.03329), submitted 2 June 2026, studies reward signal design for chunk-wise memory agents under the same GRPO framework that many reasoning-model pipelines use. Its findings converge on a set of constraints: effective reward signals should operate only on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. The normalization point is directly relevant to CARE-RL’s cross-domain problem; if rewards are not normalized before mixing, the domain with the highest raw reward magnitude dominates the gradient regardless of DACSP’s reweighting.

A concurrent negative-result paper (arXiv:2606.03280), also submitted 2 June 2026, takes on a different transfer question: whether representational alignment between models is sufficient for useful causal communication. The authors inject translated activations from a Pythia-160M sender into a Pythia-410M receiver and find that even with cosine similarity around 0.97 between sender and receiver hidden states, downstream answering does not improve. The result is a reminder that statistical similarity between representations does not guarantee functional compatibility, which is the same underlying caution that motivates DACSP’s gradient-level approach: surface-level alignment (high reward, similar representations) can coexist with hidden interference.

The direction CARE-RL points in, subspace-aware gradient modulation, is plausible and technically grounded. Whether it works outside two Qwen checkpoints on self-reported benchmarks is a question that will be answered by whoever replicates it first. Until then, the actionable finding is the problem statement itself: pooling domains into one RL run is not free, and the cost shows up in the domains you are not watching.

Frequently Asked Questions

Does DACSP require gradient-level access to the model during training?

Yes. DACSP decomposes each incoming gradient update into domain-specific principal components and reweights them before the update is applied. Teams using closed-model fine-tuning APIs (OpenAI, Anthropic, Google) that expose only loss curves or checkpoint diffs cannot inject DACSP without modifying the training loop itself. This constraint matches the paper testing exclusively on open-weight Qwen checkpoints where full gradient access is available.

What happens when a new domain is added to a DACSP-protected run mid-training?

DACSP’s protection depends on historical gradient directions extracted from prior RL stages. When a domain with no prior history (such as retrieval-augmented generation or tool use) enters the mix, all its gradient components default to “orthogonal” and pass through unmoderated. Until enough steps accumulate to establish a stored capability direction, the new domain’s updates can freely conflict with existing capabilities, creating a vulnerability window the method does not address.

Does the Pythia activation-transfer failure affect DACSP reuse across model versions?

The concurrent negative-result study (arXiv:2606.03280) showed that translated activations from Pythia-160M to Pythia-410M produced zero downstream improvement despite 0.97 cosine similarity between hidden states. This suggests that capability directions extracted for one checkpoint are unlikely to transfer reliably even to a different size within the same family. Teams upgrading from Qwen2.5-7B to a newer or larger checkpoint would need to recompute historical directions from scratch rather than porting stored subspace representations.

How can teams detect cross-domain interference without implementing subspace projection?

Run each domain’s held-out evaluation separately before and after a pooled RL run and compare per-domain deltas against single-domain baselines for the same model. A domain whose accuracy drops more than 2-3 percentage points below its single-domain trajectory is likely experiencing gradient conflict. This requires no training-loop modification, but it detects interference only after the damage is done, not during training where it could be halted or reweighted in real time.

sources · 3 cited

  1. CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts primary accessed 2026-06-03
  2. InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain analysis accessed 2026-06-03
  3. A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting analysis accessed 2026-06-03