Poisoning Open-Source LLM Merges: One Bad Checkpoint Hijacks the Result

RogueMerge, a framework published June 2 on arXiv, demonstrates that a single adversarial task vector can survive the LLM model-merging process and retain its malicious behavior in the combined model. The work evaluates the attack across six merging algorithms and over 170 merged models, finding that existing defenses do not reliably remove the payload. The implication is direct: anyone assembling a community LLM from public fine-tuned checkpoints is trusting every contributor with write access to the final model’s weights.

What model merging actually does

Model merging is the standard workflow for building open-source LLMs outside the major labs. A practitioner takes a base model (typically Llama or Mistral), collects multiple fine-tuned variants, each specialized for a different task, and combines their parameter deltas into a single checkpoint. The most common tools, mergekit and the HuggingFace model-merging UI, treat every component as a benign task vector. The merge operation, whether simple averaging or more sophisticated methods like TIES or DARE, weights the contributions and produces a combined model that ideally retains all capabilities.

The assumption behind this workflow is straightforward: if one fine-tune is slightly off, the averaging process dilutes the problem. If a checkpoint is outright malicious, the other contributors’ weights should overwhelm it. The community has operated on this assumption for over a year, with thousands of merged models published on HuggingFace.

RogueMerge frames model merging as a supply-chain problem. Each task vector is a set of parameter offsets, applied directly to the base model’s weights. The merging step does not inspect what those offsets encode. A malicious actor who publishes a fine-tuned adapter to a public repository has, by design, direct write access to every parameter in the final merged model. There is no sandboxing, no capability boundary, and no provenance check.

This is not a theoretical concern. The open-source LLM ecosystem relies on community-contributed checkpoints from unverified authors. A merged model assembled from five adapters might include four legitimate contributors and one adversary. The merge operation treats all five as equals.

Why prior backdoor work did not transfer to LLMs

Previous research on backdoor attacks against model merging focused on classifiers, not generative language models. That work used static arithmetic heuristics: embed a trigger pattern in the weights, hope it survives averaging. According to the RogueMerge paper, these approaches fail on autoregressive LLMs for three structural reasons.

First, autoregressive decoding compounds small parameter perturbations across generated tokens. A slight drift at token one becomes a significant deviation by token twenty. This means even small adversarial perturbations can propagate into coherent malicious output, but it also means naive backdoor injections that ignore the decoding chain get washed out unpredictably.

Second, the attacker does not know the victim’s merging configuration. The number of other adapters, the merge algorithm used, the weighting scheme, and the base model version are all unknown at attack time. A static injection optimized for one configuration fails when the target uses a different one.

Third, practical attacks must generalize to prompts the attacker has never seen. A backdoor that fires only on exact trigger strings memorized during optimization is useless against real-world usage.

How RogueMerge solves the optimization problem

RogueMerge replaces static arithmetic injection with a joint optimization framework designed to survive the merge step. According to the paper, the approach has three components.

The core is a meta-learning-style simulation. Rather than optimizing a backdoor and hoping it survives merging, RogueMerge simulates the merge during optimization. The attack is trained to succeed after the averaging or interpolation step, not before it. This is formulated as a stochastic min-max problem: the attacker maximizes attack success while the merge operation acts as an adversarial perturbation.

To handle the uncertainty around which merge algorithm the victim will use, RogueMerge applies distributionally robust optimization. Instead of optimizing for a single known merge configuration, it optimizes over a distribution of possible configurations, seeking an attack that performs well across the worst-case settings. The paper reports using a tractable first-order Taylor approximation to make this computationally feasible at LLM scale, with a provable error bound on the approximation.

The generalization to unseen prompts is handled implicitly by the distributionally robust formulation. By optimizing over diverse merge configurations, the resulting adversarial task vector learns features that transfer across conditions rather than overfitting to specific trigger-prompt pairs.

What the evaluation covers

The paper evaluates RogueMerge across four threat types (the specific threats are detailed in the full paper), six merging algorithms, and over 170 merged LLMs. The abstract reports that RogueMerge consistently outperforms existing attacks across these settings, remains stable across diverse merging configurations, and resists standard defense mechanisms.

Because the full tables were not available for verification, the specific attack-success rates per algorithm and per threat type cannot be cited here. What the abstract-level evidence establishes is that the attack works across the board, not just against a single weak merge method.

Why defenses fall short

Standard defenses against backdoor attacks in merged models include weight pruning, weight clipping, and anomaly detection on task vectors. The paper reports that RogueMerge survives these countermeasures. The mechanism is straightforward: because the adversarial perturbation is optimized to survive the merge operation itself, it is by definition resilient to the kind of perturbation that pruning or clipping introduces. The attack was designed to be robust to parameter modification; defenses that modify parameters are attacking the attack at its strongest point.

This creates a troubling asymmetry. The attacker optimizes with full knowledge of the defense landscape. The defender, running a merge, has no efficient way to distinguish a benign task vector from an adversarial one without reproducing the attacker’s optimization process, which requires knowing the attack objective in advance.

What this means for the merge pipeline

The practical takeaway is that provenance verification of every component checkpoint is now a hard requirement for anyone shipping a merged model. The community’s trust model, where any public fine-tune can be folded into a merge and the averaging process provides a safety net, does not hold under adversarial conditions.

Per-component attestation is the minimum viable response. Before including a task vector in a merge, the assembler needs to verify who created it, from what training data, under what process. This raises the cost of community merges substantially. It also shifts responsibility onto the merge tooling itself: frameworks like mergekit would need to support or require provenance metadata before accepting a checkpoint.

The alternative is accepting that merged models assembled from unverified public components are potentially compromised. For a community that has built its distribution model on trust-by-default, that is an uncomfortable position. RogueMerge did not create the vulnerability. Model merging always granted third-party vectors direct write access to weights. The paper simply demonstrates that the vulnerability is exploitable at LLM scale, across merge methods, and resistant to the defenses the community was counting on.

Frequently Asked Questions

Does RogueMerge target LoRA adapter merges or only full-weight task vectors?

The paper evaluates full-weight task vectors where parameter deltas have direct write access to every weight in the base model. LoRA-style low-rank adapter merges operate on decomposed matrices with constrained rank, which changes the attack geometry. RogueMerge’s distributionally robust optimization assumes unconstrained parameter access, so whether the same joint-optimization pipeline transfers to rank-limited adapter merges is outside the paper’s evaluated scope and remains an open research question.

How does a merge-step backdoor differ from poisoned training data or a malicious fine-tune?

Poisoned training data requires the attacker to influence the dataset before training begins, which is costly and subject to data auditing. A malicious fine-tune is a single identifiable compromised artifact that can be isolated. A merge-step attack is harder to attribute because the payload is distributed across the combined weights of multiple contributors. The poisoned task vector is one among several, and the resulting model has no single source of compromise. This distinguishes RogueMerge from most prior LLM supply-chain research, which focuses on data poisoning or fine-tune tampering rather than the merge step itself.

What would provenance attestation for merged models require in practice?

Each component checkpoint would need a signed manifest covering training data provenance, fine-tuning code and hyperparameters, and publisher identity. This parallels software supply-chain frameworks like SLSA (Supply-chain Levels for Software Artifacts) but applied to ML weight files. Merge tooling would need to reject any component lacking a verifiable chain. The practical barrier is that most community fine-tunes on HuggingFace currently publish only a free-text model card with no machine-readable provenance metadata, so retrofitting attestation would require a ecosystem-wide format standard.

Could merge algorithms be hardened against RogueMerge without provenance checks?

Since RogueMerge trains the adversarial vector to survive parameter perturbation, defenses like pruning and clipping fail by design. A more promising direction is merge algorithms that inject randomness the attacker cannot simulate, such as stochastic layer dropping or randomized weight interpolation that changes per merge run. This would break the meta-learning loop RogueMerge depends on, because the attacker’s training objective could no longer approximate the actual merge operation. The tradeoff is that non-deterministic merges sacrifice the reproducibility practitioners rely on for quality assurance and benchmarking.