Topic

#long-context

3 articles exploring long-context. Expert insights and analysis from our editorial team.

Showing 1–3 of 3 articles

Articles

Newest first

K-Token Merging Compresses Sequences in Latent Space, Lowering KV Cache Floors for 24GB and 48GB Cards

K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.

April 23, 2026

Models & Research

Sessa Breaks the Mamba-or-Transformer Binary: Distance-Invariant Retrieval Forces a Rethink of Long-Context Architecture Choices

Sessa embeds attention inside a recurrent loop, outperforming Transformer and Mamba on long-context tasks. The interaction topology matters more than the attention-SSM ratio.

April 22, 2026

Infrastructure & Runtime

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.

April 22, 2026

Browse All Topics