Do Foundation Models Actually Learn Relational Structure In-Context?

Relational foundation models promise to learn across joined database tables without per-schema training. OpenRFM, submitted June 3, 2026, provides the first mechanistic account of why that promise collapses in practice and introduces a dual-stage architecture that surpasses the commercial KumoRFMv1 baseline. The answer to whether relational in-context learning transfers is: not with the architectures people have been using, and not without understanding what the pre-training data teaches the model.

What relational in-context learning actually does

The Relational Transformer (RT), the dominant parametric architecture for relational foundation models, does not perform zero-shot inference in the way the term suggests. According to OpenRFM’s mechanistic analysis, RT performs relation-level ICL: it gathers label-carrying cells along a breadth-first search walk across the joined tables, and these labeled cells form the support set of an implicit kernel regression. The model predicts the target cell’s label by computing a similarity-weighted aggregate over that support set.

This works when label coverage is dense. When the BFS walk encounters few labeled cells across the relevant joins, the regression is underdetermined. The model has insufficient signal to compute well-conditioned weights and collapses. The authors identify this collapse as the primary failure mode for RT-based models on real-world tasks, where sparse label coverage across foreign-key joins is the norm (OpenRFM, §3).

Synthetic pre-training induces a lazy-kernel regime

Whether the pre-training distribution matters turns out to be the central question. When OpenRFM’s authors pre-train the RT backbone exclusively on a synthetic relational dataset, the model enters what they call a lazy-kernel regime: its representations become nearly fixed, and it converges to a static kernel method that cannot adapt its features to downstream task structure (OpenRFM ablation section).

Pre-training on in-distribution real databases triggers a feature-learning regime instead, where the model adjusts its internal representations based on relational structure. The differentiating factor is whether the label-generation process exhibits what the authors formalize as relational homophily: cells connected via the same relational path tend to share latent features the model can exploit. Synthetic data that lacks this property produces a model that memorizes surface-level patterns from the pre-training distribution without learning transferable relational reasoning.

Dual-stage ICL: why one channel is not enough

OpenRFM’s architectural contribution is a second ICL channel. The RT backbone handles relation-level ICL, walking across tables. A batch-level ICL layer, initialized from a pre-trained tabular foundation model, handles instance-level ICL within a single table. The two stages compose: the relation-level stage produces representations that feed into the batch-level stage, giving the model both cross-table and within-table context.

When the relation-level walk fails due to sparse label coverage, the batch-level layer falls back on within-table patterns. When the relation-level walk succeeds, the batch-level layer has richer input to work with. Combined with the pre-training changes below, this dual-stage design improves average task performance by approximately 30% over the RT backbone alone, according to the OpenRFM abstract.

This is the kind of fix that sounds obvious in retrospect but requires the mechanistic diagnosis to motivate. Without understanding that RT’s failure mode is an underdetermined kernel regression, adding a batch-level layer would look like an ad hoc ensemble rather than a targeted correction.

OpenRFM’s pre-training recipe

The pre-training changes are twofold. First, OpenRFM replaces synthetic-only pre-training with a synthetic-plus-continual-real-data mixture. The synthetic component provides volume and structural diversity; the real-data component provides the relational homophily signal that prevents the lazy-kernel collapse.

Second, the authors introduce prototype-based regularization during pre-training, which encourages the model to learn cluster-aware representations aligned with the relational structure of the training databases (OpenRFM full paper) rather than collapsing all variation into a single fixed kernel.

The combination allows OpenRFM to surpass KumoRFMv1, a commercial relational foundation model, on a large set of evaluation tasks. The paper does not specify the exact number of tasks or the margin over KumoRFMv1, and the exact evaluation protocol is not fully specified, so the comparison should be read as directional rather than definitive.

Three paths, one bottleneck

OpenRFM is not the only recent attempt to solve the relational foundation model problem. RDB-PFN (arXiv 2603.03805), an ICML submission, takes a different route: it linearizes the relational database via depth-first search, then feeds the linearized representation into a vanilla Transformer trained on over 2 million synthetic tasks using a Relational Prior Generator. RDB-PFN outperforms GBDT baselines and single-table foundation models on 19 real-world tasks while using a lightweight architecture and fast inference (RDB-PFN abstract).

The tension between these two approaches is real. OpenRFM argues that architectural changes (dual-stage ICL, homophily-aware pre-training) are necessary to overcome the lazy-kernel regime. RDB-PFN achieves competitive results with a simpler architecture by investing heavily in the synthetic-data prior. If the prior is well-designed, a vanilla Transformer can perform relational ICL without explicit relational attention. This suggests the bottleneck is more data-centric than architectural, a conclusion both papers converge on.

The broader landscape includes Griffin, which unifies the data encoder and task decoder and pre-trains on both single-table and relational datasets (ICML 2025), and the baseline RT, which relies on limited open-source real data and does not achieve universal generalization without fine-tuning (RDB-PFN, §1). Each represents a different tradeoff between generality and per-task adaptation cost.

Model	Architecture	Pre-training data	Per-task adaptation
OpenRFM	Dual-stage ICL (RT + tabular FM)	Synthetic + continual real	None reported
RDB-PFN	DFS-linearized + vanilla Transformer	Synthetic (2M+ tasks)	None
KumoRFMv1	Proprietary	Proprietary	Unknown
Griffin	Unified encoder-decoder	Single-table + RDB datasets	Fine-tuning may be needed
RT (baseline)	Relational Transformer	Real or synthetic	Fine-tuning for generalization

When to use an RFM

The practitioner takeaway depends on the shape of your data. If your relational schema has dense label coverage across foreign-key joins, a single-channel RT backbone may suffice; the implicit kernel regression has enough support points to be well-conditioned. If label coverage is sparse, which is the common case in production databases, you need either a dual-stage architecture like OpenRFM’s or a batch-level fallback.

The build-vs-buy calculus has shifted as of June 2026. OpenRFM surpasses KumoRFMv1 on a large set of evaluation tasks, narrowing the gap between open and commercial RFMs. For teams that cannot share database schemas with a third-party API, an open model pre-trained on internal data is now a viable path.

The core bottleneck remains data, not architecture. High-quality relational databases are private, structurally heterogeneous, and scarce. No amount of architectural innovation compensates for a pre-training distribution that lacks the relational structure the model needs to learn. Both OpenRFM and RDB-PFN acknowledge this constraint and invest heavily in synthetic data generation. Whether that synthetic data encodes relational homophily appears to matter more than the choice of attention mechanism.

For teams currently using GBDTs or GNNs on tabular data, the relevant question is whether the upfront cost of relational foundation model setup (schema prompting, pre-training data curation, architectural selection) is lower than the ongoing cost of feature engineering and per-task model training. That depends on how many distinct relational prediction tasks run on the same schema, and how sparse labels are across tables. One task per schema: stick with a GBDT. A dozen tasks with shared relational structure: the RFM math starts to work.

Frequently Asked Questions

Does the lazy-kernel diagnosis apply to GNN-based relational models?

The kernel regression framing OpenRFM uses to diagnose lazy-versus-feature-learning regimes is specific to the Relational Transformer’s cell-level attention aggregation. GNN-based models like Griffin, which unify the encoder and decoder across graph propagation steps, may exhibit different failure modes entirely. The paper does not test whether a graph-convolutional architecture enters a similar lazy-kernel regime under synthetic-only pre-training, so the diagnosis should not be assumed to transfer.

What does RDB-PFN sacrifice by linearizing the relational schema?

RDB-PFN’s depth-first search linearization flattens a multi-table schema into a single sequence, so the vanilla Transformer cannot selectively attend to cells based on their position in the relational graph or weight join paths differently. That it still beats GBDT baselines on 19 tasks using a fraction of the parameters suggests the structural prior encoded in the DFS ordering compensates for the lost relational attention. The question the paper leaves open is whether that compensation holds on schemas with deep join chains or heterogeneous key types.

What per-task adaptation does the baseline RT require that OpenRFM removes?

The baseline Relational Transformer needs per-task checkpoint selection even after end-to-end pre-training on real databases: the training run produces multiple checkpoints, and the practitioner must identify which one works best for a given prediction task. OpenRFM’s dual-stage architecture with homophily-aware pre-training is designed to produce a single checkpoint that generalizes across tasks without that manual selection step.

Which benchmark should teams use when comparing RFMs?

OpenRFM and RDB-PFN both evaluate on subsets of RelBench-v1, the standard benchmark suite for relational deep learning, but they draw different task subsets and report non-overlapping counts (OpenRFM on an unspecified number, RDB-PFN on 19). Direct numerical comparison between the two is unreliable without re-running both on the same RelBench-v1 split. Teams evaluating RFMs should pin their comparison to a fixed split and report task-level variance, not just averages.