Sessa Breaks the Mamba-or-Transformer Binary: Distance-Invariant Retrieval Forces a Rethink of Long-Context Architecture Choices

Q: Can Sessa replace Mamba or Jamba in production systems today?

No. Sessa has only been evaluated at 125M and 350M parameters on synthetic tasks and one language modeling dataset. Both Mamba and Jamba have been validated at billion-parameter scale and on standard downstream benchmarks.

Q: What is distance-invariant retrieval and why does it matter?

In most recurrent architectures, a token's influence on future states weakens as the distance between them grows. Sessa's feedback-loop topology allows past tokens to affect future states through attention-based paths without this decay, which matters for tasks requiring retrieval across thousands of tokens.

Q: Why does Sessa generalize better to longer sequences than its baselines?

Sessa's attention operates on recurrently transformed states and feeds back into the recurrent path, creating multiple routes for information to persist. Baseline models either compress history through a single state gate (Mamba) or process attention and recurrence as parallel subsystems without shared recurrent state (Jamba).

Q: Should teams currently using Jamba or Mamba reconsider their architecture?

Not on the basis of Sessa alone. The architectural insight—that nesting attention inside recurrence may outperform interleaving—is theoretically motivated but empirically unproven at scale. The safer interpretation is that the hybrid design space is larger than previously explored.

Sessa places self-attention inside a recurrent feedback loop rather than alongside it, creating multiple attention-based paths through which past tokens influence future states (Sessa: Selective State Space Attention). On synthetic long-context benchmarks, the architecture outperforms both pure Transformer and Mamba baselines, and its memory influence does not decay with token distance under the paper’s theoretical framework (Sessa: Selective State Space Attention). The result challenges the assumption that teams can simply swap attention for state space models to fix long-context degradation.

What Sessa Actually Does (and What It Doesn’t)

Sessa—Selective State Space Attention—is not a hybrid that interleaves attention and state space model layers. It embeds self-attention inside a recurrent feedback path, which means past tokens can influence future states through multiple attention-based routes rather than a single forward pass (Sessa: Selective State Space Attention). Under explicit theoretical assumptions, the authors prove that this topology admits power-law memory tails $O(\ell^{-\beta})$ for $0 < \beta < 1$ with slower decay than equivalent Transformer and Mamba baselines, and is the only model class among those considered that achieves flexible selective retrieval whose influence does not decay with distance (Sessa: Selective State Space Attention).

What it does not do—at least not yet—is demonstrate these properties at scale. The paper evaluates models of 125M and 350M parameters on synthetic tasks and a single language modeling dataset (Sessa: Selective State Space Attention). No standard downstream benchmarks such as MMLU or HellaSwag appear, and the theoretical guarantees hold under explicit assumptions that may or may not translate to real-world data (Sessa: Selective State Space Attention).

The Benchmark Gap: Where Sessa Pulls Ahead

On SymbolSoup, a sequence classification task with length 2048, Sessa scores 0.89 accuracy against 0.52 for the Transformer baseline and 0.61 for Mamba (Sessa: Selective State Space Attention). The gap is even wider on Diffuse MQAR, a multi-query associative recall task at the same length: Sessa reaches 0.78 accuracy, while the Transformer manages 0.31 and Mamba 0.45 (Sessa: Selective State Space Attention).

The language modeling results on SimpleStories are narrower but consistent. At 125M parameters, Sessa achieves 17.6 perplexity versus 18.2 for the Transformer and 17.9 for Mamba; at 350M parameters, the spread is 14.3 versus 14.8 and 14.5 respectively (Sessa: Selective State Space Attention). The more telling number is length generalization: when trained on 2048-token sequences and evaluated on 8192-token sequences, Sessa records 16.2 perplexity against 23.4 for the Transformer and 19.7 for Mamba (Sessa: Selective State Space Attention).

Model	SymbolSoup (2048)	Diffuse MQAR (2048)	SimpleStories PPL (125M)	SimpleStories PPL (8K eval)
Transformer	0.52	0.31	18.2	23.4
Mamba	0.61	0.45	17.9	19.7
Sessa	0.89	0.78	17.6	16.2

These gaps suggest that Sessa’s topology confers a qualitative advantage on tasks requiring retrieval across long distances, even when the absolute perplexity differences at matched length are modest.

Inside the Loop vs. Alongside It: The Architectural Insight

The prevailing intuition in long-context architecture design has been a slider: more attention here, more SSM there. Mamba eliminates attention entirely in favor of selective state space models with input-dependent parameters, trading recurrent compression for the ability to attend globally (Mamba: Linear-Time Sequence Modeling with Selective State Spaces). Jamba, by contrast, stacks Transformer blocks and Mamba blocks next to each other, achieving strong results up to 256K tokens but keeping the two mechanisms in parallel subsystems (Jamba: A Hybrid Transformer-Mamba Language Model).

Sessa’s design treats this as a false trade. By placing attention inside the recurrent feedback loop, the architecture creates a fundamentally different interaction topology: attention operates on states that have already been transformed by recurrence, and those attended outputs feed back into the recurrent path (Sessa: Selective State Space Attention). The authors argue that this interaction pattern matters more than the ratio of attention to SSM components (Sessa: Selective State Space Attention). If that claim holds, the implication is that simply interleaving Jamba-style layers may leave performance on the table compared to architectures that nest the two mechanisms.

How Existing Hybrids Compare

Mamba’s selective state space approach achieves 5× higher inference throughput than Transformers and scales linearly in sequence length by compressing history into a fixed-size hidden state (Mamba: Linear-Time Sequence Modeling with Selective State Spaces). That compression is also its limitation: information must be forced through the state update gate, and retrieval quality degrades with distance for tokens that were not selectively retained.

Jamba avoids this by preserving full attention layers, but the attention and SSM components do not exchange information through a shared recurrent state (Jamba: A Hybrid Transformer-Mamba Language Model). Each block processes the other’s output sequentially, not recursively. Sessa’s feedback loop blurs this boundary: the same hidden state carries both recurrent and attended information forward, which the authors identify as the source of the distance-invariant retrieval property (Sessa: Selective State Space Attention).

What to Watch Before Betting on It

The most immediate risk is scale. All reported results come from a single-author paper with models smaller than many production embedding layers (Sessa: Selective State Space Attention). Sessa may retain its advantages at 7B or 70B parameters, or the gaps may compress as model capacity makes up for architectural limitations in the baselines. There is no way to know without larger runs.

The second risk is the theoretical framework itself. The proof of distance-invariant retrieval relies on explicit assumptions about the data distribution and the recurrent transition dynamics (Sessa: Selective State Space Attention). Whether those assumptions hold for natural language—or code, or multimodal sequences—is an open question. If they fail, the memory advantages may attenuate.

The third risk is practical adoption. The official PyTorch implementation is available on GitHub under an Apache 2.0 license with FlashAttention support, but the repository has 2 commits and 2 stars (LibratioAI/sessa GitHub Repository). That is not a confidence metric for production readiness or community maintenance.

Frequently Asked Questions

Can Sessa replace Mamba or Jamba in production systems today?

No. Sessa has only been evaluated at 125M and 350M parameters on synthetic tasks and one language modeling dataset. Both Mamba and Jamba have been validated at billion-parameter scale and on standard downstream benchmarks.

What is distance-invariant retrieval and why does it matter?

In most recurrent architectures, a token’s influence on future states weakens as the distance between them grows. Sessa’s feedback-loop topology allows past tokens to affect future states through attention-based paths without this decay, which matters for tasks requiring retrieval across thousands of tokens.

Why does Sessa generalize better to longer sequences than its baselines?

Sessa’s attention operates on recurrently transformed states and feeds back into the recurrent path, creating multiple routes for information to persist. Baseline models either compress history through a single state gate (Mamba) or process attention and recurrence as parallel subsystems without shared recurrent state (Jamba).

Should teams currently using Jamba or Mamba reconsider their architecture?

Not on the basis of Sessa alone. The architectural insight—that nesting attention inside recurrence may outperform interleaving—is theoretically motivated but empirically unproven at scale. The safer interpretation is that the hybrid design space is larger than previously explored.

Sessa Breaks the Mamba-or-Transformer Binary: Distance-Invariant Retrieval Forces a Rethink of Long-Context Architecture Choices

What Sessa Actually Does (and What It Doesn’t)

The Benchmark Gap: Where Sessa Pulls Ahead

Inside the Loop vs. Alongside It: The Architectural Insight

How Existing Hybrids Compare

What to Watch Before Betting on It

Frequently Asked Questions

Can Sessa replace Mamba or Jamba in production systems today?

What is distance-invariant retrieval and why does it matter?

Why does Sessa generalize better to longer sequences than its baselines?

Should teams currently using Jamba or Mamba reconsider their architecture?

Sources

Enjoyed this article?

What Sessa Actually Does (and What It Doesn’t)

The Benchmark Gap: Where Sessa Pulls Ahead

Inside the Loop vs. Alongside It: The Architectural Insight

How Existing Hybrids Compare

What to Watch Before Betting on It

Frequently Asked Questions

Can Sessa replace Mamba or Jamba in production systems today?

What is distance-invariant retrieval and why does it matter?

Why does Sessa generalize better to longer sequences than its baselines?

Should teams currently using Jamba or Mamba reconsider their architecture?

Sources

Related Articles

DuQuant++ Makes FP4 Quantization Practical for LLM Inference: What Fine-Grained Rotation Means for Blackwell Deployments

DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU

Fixed Entropy Coefficients Break Down on Mixed-Difficulty Tasks: What AER Means for Teams Running LLM RL at Scale

Enjoyed this article?