Can You Stitch Two Foundation Models Together Without Retraining?

You can, but not the way most practitioners assume. A CVPR 2026 paper on model stitching across heterogeneous vision foundation models finds that naively splicing layers from independently trained models fails badly, and that making it work requires a specific training strategy, Final Feature Matching, rather than wishful reuse. The result has direct implications for anyone building model-merging pipelines or betting that pretrained encoders are interchangeable parts.

What Model Stitching Is and Why It Matters for Foundation Models

Model stitching is exactly what it sounds like: take model A’s early layers, model B’s late layers, insert a trainable “stitch” module at the join point, and see whether the resulting hybrid functions. If it does, you get the benefit of both models’ learned representations without retraining either from scratch. If it doesn’t, you’ve learned something equally important: that the internal representations of those models are not compatible, regardless of how similar their outputs look on a given task.

The technique has been around since at least 2021, when Bansal et al. used it to probe whether neural networks trained on the same dataset converge to similar internal features regardless of their initialization or training objective. Their finding was optimistic: same dataset, different training, still stitchable. But that work was limited to small models on a single dataset. The question of whether the same principle holds for modern, independently trained foundation models with different architectures, objectives, and pretraining corpora has remained open until now.

The new study, led by Zheda Mai (Ohio State University) and collaborators at Amazon and Boston University, extends stitching to a regime that matters in practice: heterogeneous Vision Foundation Models (VFMs) including CLIP, DINOv2, and SigLIP 2, trained on different datasets, with different objectives (contrastive, self-supervised reconstruction), and different modality mixes (vision-language vs. pure vision).

The Old Assumption: Same Dataset = Interchangeable Internals

The Bansal et al. result established a convenient intuition: if two models learn from the same data distribution, their intermediate representations become functionally compatible, even if their architectures and training objectives differ. This is the assumption behind a lot of practical model-reuse workflows. If you’re merging adapters, swapping encoders, or building “frankenmodel” pipelines that mix components from different pretrained checkpoints, you are implicitly betting on some version of this representational compatibility.

That bet is riskier than it looks. The new paper distinguishes between two kinds of similarity: representational similarity (measured by statistical tools like CKA, which quantifies how aligned two sets of activations are) and functional similarity (whether a stitched hybrid actually performs well on a downstream task). High CKA scores between two models do not reliably predict stitching success (§2.1, authors-reported). Practitioners who use CKA or similar metrics to screen models for compatibility in a merging pipeline are working on an unreliable signal.

What Happens When You Stitch Heterogeneous VFMs Naively

When the authors tried stitching their VFMs using established methods from the small-model literature, the results were unambiguous: both Layer Feature Matching (aligning intermediate features at the stitch point) and Task-Loss Training (directly optimizing the downstream cross-entropy) struggled. In some configurations the stitched model’s accuracy was substantially worse than either constituent model on its own (authors-reported, §1 and Figure 1).

The failures were concentrated at shallow stitch positions, where the stitch module has to bridge a larger representational gap between the early layers of one model and the deep layers of another. This is the regime most likely to matter in practice: if you want to reuse someone else’s pretrained encoder and attach your own task head, the join point is typically somewhere in the middle layers, not at the final output.

The Fix: Final Feature Matching

The authors’ main contribution is a training strategy they call “Final Feature Matching” (FFM): training the stitch layer with a feature-matching loss at the target model’s penultimate layer, so the stitched model’s output aligns with the target’s representation space (Figure 1(b)). Unlike layer feature matching, which aligns features at the stitch point itself, or direct task-loss optimization, FFM constrains the stitch to produce outputs the target model’s later layers can actually work with.

With FFM, heterogeneous VFMs become reliably stitchable across vision tasks including classification and semantic segmentation (authors-reported, §1). The key insight is that matching features at the target’s penultimate layer works where matching at the stitch point does not: the loss gives the stitch module a well-defined target in the destination representation space.

For deep stitch positions, the stitched model can exceed both constituent models’ accuracy (authors-reported, §1).

The practical payoff of stitching that works is the authors’ VFM Stitch Tree (VST) architecture. The idea: share early layers across multiple VFMs while retaining specialized deep layers for each. This lets you serve multiple foundation models from a single deployment with controlled resource overhead.

The authors demonstrate VST on multimodal LLMs that combine multiple VFMs, showing a controllable accuracy-latency tradeoff: sharing more early layers reduces overhead but captures less of the performance gain, while retaining more specialized layers costs more but recovers more of the gain (authors-reported, §1).

What This Means for Model-Merging and Frankenmodel Pipelines

The practical implications cut in two directions.

First, the cautionary finding: naive stitching between independently trained foundation models fails. If your workflow involves grabbing a pretrained encoder from one model family and bolting it onto a decoder or task head from another, expecting it to work because both models were “trained on internet data,” this paper says it probably won’t, at least not without explicit training at the join point.

Concurrent work on adapter composition reinforces this. HiP-LoRA shows that even low-rank LoRA updates can concentrate perturbations along dominant singular directions of pretrained weights, causing catastrophic forgetting and making multi-adapter merges fragile. The common thread: naive model composition, whether by stitching layers or merging adapters, is riskier than the tooling around it suggests.

Second, the constructive finding: with FFM, stitching can work and even produce models that outperform either parent. The engineering cost is nontrivial: you need to train the stitch layer with FFM at every join point, which is not the same as zero-cost reuse. But it’s far cheaper than retraining a foundation model from scratch.

The gap between “CKA says these models look similar” and “these models are functionally stitchable” is the finding that should change how practitioners evaluate representational compatibility. If you’re building tooling that selects models for merging based on CKA or related metrics, the signal you’re acting on may not predict the outcome you care about.

Caveats: Single Preprint, Vision-Only, No Independent Replication

The standard cautions for a single-preprint study apply in full. Every number in this article comes from the authors’ own experiments, and the paper has not been independently replicated as of June 2026. The CVPR 2026 acceptance is a quality signal, but conference review does not equal replication.

The study covers vision foundation models only. Whether FFM works for stitching layers between LLMs, or between vision and language models in a multimodal architecture, is an open question. The representational dynamics of autoregressive transformers trained on text may differ enough from the vision models studied here that the recipe needs modification or fails entirely.

The VST efficiency claims are specific to a limited set of multimodal LLM configurations. The overhead percentages and gain-capture fractions depend on the number of shared vs. specialized layers and the number of VFMs in the tree. Scaling to more models or different architectures may change the tradeoff curve in ways the current experiments don’t cover.

Frequently Asked Questions

Does stitching outperform just adding capacity to one model?

Yes. The authors ran self-stitch controls, inserting the same trainable module into a single VFM with no cross-model component. Those controls consistently underperformed the cross-VFM stitch, ruling out a mere capacity increase and pointing to genuine complementary knowledge fusion between the two parent models (Figure 5).

What should teams check before trying to stitch two models?

Run a small-scale FFM probe at the intended join point before committing. The paper shows CKA, the most common representational-similarity metric, does not predict stitching success. Critics of the original Anna Karenina hypothesis have argued that stitching success reflects representational clustering rather than true semantic alignment, so similarity scores are an unreliable pre-filter.

What resource overhead does the VFM Stitch Tree add in practice?

On a LLaVA model combining two VFMs, sharing 22 early layers with 1 specialized layer per VFM adds roughly 4.3% resource overhead and captures 45% of the full two-VFM performance gain. Sharing 14 layers with 9 specialized layers costs about 40% extra but recovers 84% of the gain. These numbers are specific to the tested LLaVA configuration.

Can FFM stitch a vision encoder to a language model decoder?

Not tested. Every experiment stitches layers between vision models (CLIP, DINOv2, DINOv3, SigLIP 2). Cross-modal stitching introduces a representational gap the current recipe may not bridge. Autoregressive text transformers have different internal dynamics than contrastive or self-supervised vision encoders, so the FFM loss target would likely need rethinking for that setting.