Topic

#long-context

3 articles exploring long-context. Expert insights and analysis from our editorial team.

Showing 1–3 of 3 articles

Articles

Newest first
Infrastructure & Runtime

K-Token Merging Compresses Sequences in Latent Space — Lowering the KV Cache Floor for Long-Context Serving on 24GB and 48GB Cards

K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.

Models & Research

Sessa Breaks the Mamba-or-Transformer Binary: Distance-Invariant Retrieval Forces a Rethink of Long-Context Architecture Choices

Sessa embeds attention inside a recurrent loop, outperforming Transformer and Mamba on long-context tasks. The interaction topology matters more than the attention-SSM ratio.

Infrastructure & Runtime

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.