Cross-Domain RL Training Degrades Capabilities. CARE-RL Reweights to Fix It

The standard recipe for post-training a reasoning model is to pool math, code, and chat into a single reinforcement-learning run and let the reward signal sort everything out. CARE-RL (arXiv:2606.00609), submitted 30 May 2026 by Rui Zhang, Xinle Wu, and Yao Lu, argues that this recipe has a hidden cost: gradient updates that improve one domain can silently degrade another, because naive reward pooling produces credit-assignment conflicts that the training loop never detects. The paper proposes a capability-aware reweighting scheme to mitigate the interference. All reported results are the authors’ own figures on two Qwen checkpoints, with no independent replication.

The cross-domain interference problem

Single-domain RL post-training has a straightforward credit-assignment story: the reward signal is homogeneous, the task distribution is narrow, and gradients generally point in the same direction. Multi-domain RL breaks that assumption. When a batch contains math proofs, code generation, and open-ended chat, the reward signals live on different scales and encode different notions of “good.” A gradient step that improves math reasoning can shift the model’s representations in ways that hurt its conversational ability, and the training loop has no mechanism to notice.

CARE-RL frames this as a negative-transfer problem inside the optimization itself. The conflict is not between tasks in a multitask learning sense; it is between the gradient directions that different domains push the model toward. When those directions are misaligned, the dominant domain (typically the one with the highest reward variance or the largest batch share) wins, and the losing domains’ capabilities erode.

Direction-Aware Capability Subspace Projection (DACSP)

CARE-RL’s core mitigation is DACSP, which intervenes at the gradient level. It works in three steps:

Extract historical capability directions. From prior RL stages, DACSP computes the principal gradient directions associated with each domain’s capability. These directions encode what the model “learned” for math, for code, for chat, and so on.
Decompose the incoming gradient. For each new update, DACSP projects it onto the historical capability directions and classifies the components: aligned (the update pushes in the same direction as the domain’s prior learning), conflicting (the update pushes against it), and orthogonal (the update is in a new direction that does not interact with existing capabilities).
Reweight by component type. Aligned components get amplified. Conflicting components get suppressed. Orthogonal components pass through unchanged.

This is a gradient-editing strategy, not a reward-shaping one. It does not change what the model is rewarded for; it changes how the reward signal propagates into weight updates. The distinction matters because reward shaping requires per-domain reward engineering, while DACSP operates on the geometry of the gradient space and is, in principle, domain-agnostic.

DACSP in the gradient-surgery lineage

The mechanism is not new in shape, only in setting. Gradient surgery for conflicting objectives has a decade of supervised-learning precedent. PCGrad (Yu et al., 2020) projects a task gradient onto the normal plane of any other task gradient it conflicts with, deleting the negative-cosine component outright. GradNorm balances the magnitudes of per-task gradients so that no single loss dominates by sheer scale. The closest relative is Gradient Projection Memory (Saha et al., ICLR 2021), which runs SVD on the activations of prior tasks to extract a “core” subspace and then projects new updates orthogonal to it so they cannot overwrite what was already learned. DACSP’s “extract principal capability directions, decompose the incoming gradient, suppress the conflicting component” recipe is recognizably GPM moved from continual supervised learning into the RL post-training loop.

Two differences are worth naming. First, GPM and Orthogonal Gradient Descent only protect old directions by deleting interference; DACSP additionally amplifies the aligned component, which is a reinforcement move rather than a forgetting-prevention one. Whether amplification helps or just accelerates overfitting to the dominant domain is exactly the kind of thing per-domain tables would reveal and the abstract does not. Second, the supervised methods compute their subspaces from comparatively low-variance gradients. Policy-gradient RL gradients are far noisier, and the principal directions DACSP stores are estimated from that noise. A capability direction recovered from a high-variance, reward-scaled gradient is a much weaker object than one recovered from a clean supervised loss, and the paper offers no evidence about how stable those directions are across training. The idea is sound; the estimation problem in the RL regime is harder than the supervised analogues that inspired it. This is the same tension that shows up in work on which failed reasoning traces RL can actually repair, where the signal you want to act on is buried in the variance of the trajectories you collected.

Protocol-Aware Generative Reward Model (PA-GRM)

The second component addresses a prerequisite problem: to detect cross-domain interference, you need reward signals that are comparable across domains in the first place. Verifiable tasks (math with a ground-truth answer, code with unit tests) have objective reward signals. Non-verifiable tasks (open-ended chat, instruction following) do not.

PA-GRM builds prompt-level evaluation protocols before generating rewards. Rather than scoring a response directly, the model first constructs a rubric-like evaluation plan tailored to the specific prompt, then produces a trace-conditioned reward against that plan. The intent is to make chat and instruction-following rewards as structured as math rewards, so that cross-domain gradient conflicts are detectable rather than buried in noise.

Whether PA-GRM actually produces rewards that are comparable across domains at the precision DACSP needs is an empirical question, and the paper’s self-reported benchmarks are the only evidence currently available.

Reported results and their caveats

According to the authors’ abstract, CARE-RL achieves Total Avg scores of 47.9 on Qwen2.5-7B and 50.7 on Qwen3-4B across benchmarks spanning math, chat, and instruction following, consistently outperforming standard multi-domain RL baselines. These numbers come from the paper’s own evaluation setup:

No independent replication exists as of 2026-06-03.
The model coverage is narrow: two Qwen checkpoints (7B and 4B parameter counts), both from the same model family.
No GPT, Claude, or Gemini baselines are reported, so the interaction between CARE-RL and proprietary model architectures is unknown.
Per-domain breakdowns are not available in the abstract, which means the practitioner cannot assess whether the Total Avg improvement comes from uniform gains across domains or from large gains in one domain masking continued degradation in another.

The last point is the one that matters for anyone considering implementing this. A Total Avg of 50.7 is compatible with math improving by 8 points and chat degrading by 3, or with all domains improving by similar margins. Without per-domain tables, the practitioner must replicate the experiment on their own model and data to learn whether the method helps their specific capability profile.

What this means for post-training pipelines

The practical takeaway is not “adopt DACSP” (the evidence is too thin for that). It is that multi-domain RL post-training has a cost that most pipelines currently treat as zero.

If you are running math, code, and chat through a single RL stage with a pooled reward, CARE-RL’s framing suggests you should measure per-domain capability drift during training, not just aggregate reward. The paper’s interference mechanism predicts that domains with lower reward variance or smaller batch shares will silently degrade. A lightweight diagnostic would be to track per-domain held-out accuracy every N steps and alert on divergence, which is cheaper than implementing full subspace projection.

It is also worth asking what DACSP is competing against, because gradient surgery is the most invasive option on the menu. The cheapest mitigation is sequencing: run the domains in separate RL stages with a curriculum rather than pooling them, accepting some forgetting between stages in exchange for never mixing conflicting gradients in one batch. A middle option is structural isolation, training a separate LoRA adapter per domain and merging or routing at inference, so the conflicting updates never share the same weights to begin with. A third is the boring one InfoMem points at: normalize and scale the rewards correctly before they are composed, which removes the magnitude-domination failure mode without touching the gradient at all. CARE-RL’s pitch is that none of these recover the cross-domain transfer you actually want, that a model trained well on math should get better at code, and that only same-weight training with conflict-aware reweighting preserves the positive transfer while suppressing the negative. That is a real argument, but the paper does not isolate it against these baselines, so the marginal value of DACSP over a well-normalized GRPO run with a curriculum is unmeasured. The reweighting-versus-resampling tradeoff is the same one that surfaces when a single RLHF pass cannot satisfy heterogeneous reward targets: you can either fix the optimizer or fix the data mixture, and the two are rarely compared head to head.

For teams that do want to adopt capability-aware gradient editing, DACSP’s compute overhead is the open question. The historical capability directions require storing and updating principal-component representations for each domain, and the per-step projection adds a matrix operation to every gradient update. The paper does not report wall-clock or FLOPS overhead as of the available abstract.

Broader context: reward design and representational transfer

CARE-RL is not the only paper probing the assumptions underneath current RL-for-LLM practice. Two concurrent submissions reinforce different facets of the problem.

InfoMem (arXiv:2606.03329), submitted 2 June 2026, studies reward signal design for chunk-wise memory agents under the same GRPO framework that many reasoning-model pipelines use. Its findings converge on a set of constraints: effective reward signals should operate only on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. The normalization point is directly relevant to CARE-RL’s cross-domain problem; if rewards are not normalized before mixing, the domain with the highest raw reward magnitude dominates the gradient regardless of DACSP’s reweighting.

A concurrent negative-result paper (arXiv:2606.03280), also submitted 2 June 2026, takes on a different transfer question: whether representational alignment between models is sufficient for useful causal communication. The authors inject translated activations from a Pythia-160M sender into a Pythia-410M receiver and find that even with cosine similarity around 0.97 between sender and receiver hidden states, downstream answering does not improve. The result is a reminder that statistical similarity between representations does not guarantee functional compatibility, which is the same underlying caution that motivates DACSP’s gradient-level approach: surface-level alignment (high reward, similar representations) can coexist with hidden interference.

The direction CARE-RL points in, subspace-aware gradient modulation, is plausible and technically grounded. Whether it works outside two Qwen checkpoints on self-reported benchmarks is a question that will be answered by whoever replicates it first. Until then, the actionable finding is the problem statement itself: pooling domains into one RL run is not free, and the cost shows up in the domains you are not watching.

Frequently Asked Questions

Does DACSP require gradient-level access to the model during training?

Yes. DACSP decomposes each incoming gradient update into domain-specific principal components and reweights them before the update is applied. Teams using closed-model fine-tuning APIs (OpenAI, Anthropic, Google) that expose only loss curves or checkpoint diffs cannot inject DACSP without modifying the training loop itself. This constraint matches the paper testing exclusively on open-weight Qwen checkpoints where full gradient access is available.

What happens when a new domain is added to a DACSP-protected run mid-training?

DACSP’s protection depends on historical gradient directions extracted from prior RL stages. When a domain with no prior history (such as retrieval-augmented generation or tool use) enters the mix, all its gradient components default to “orthogonal” and pass through unmoderated. Until enough steps accumulate to establish a stored capability direction, the new domain’s updates can freely conflict with existing capabilities, creating a vulnerability window the method does not address.

Does the Pythia activation-transfer failure affect DACSP reuse across model versions?

The concurrent negative-result study (arXiv:2606.03280) showed that translated activations from Pythia-160M to Pythia-410M produced zero downstream improvement despite 0.97 cosine similarity between hidden states. This suggests that capability directions extracted for one checkpoint are unlikely to transfer reliably even to a different size within the same family. Teams upgrading from Qwen2.5-7B to a newer or larger checkpoint would need to recompute historical directions from scratch rather than porting stored subspace representations.

How can teams detect cross-domain interference without implementing subspace projection?

Run each domain’s held-out evaluation separately before and after a pooled RL run and compare per-domain deltas against single-domain baselines for the same model. A domain whose accuracy drops more than 2-3 percentage points below its single-domain trajectory is likely experiencing gradient conflict. This requires no training-loop modification, but it detects interference only after the damage is done, not during training where it could be halted or reweighted in real time.