Topic
#long-context
3 articles exploring long-context. Expert insights and analysis from our editorial team.
Showing 1–3 of 3 articles
Articles
Newest first
Infrastructure & Runtime
K-Token Merging Compresses Sequences in Latent Space — Lowering the KV Cache Floor for Long-Context Serving on 24GB and 48GB Cards
K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.
Models & Research
Sessa Breaks the Mamba-or-Transformer Binary: Distance-Invariant Retrieval Forces a Rethink of Long-Context Architecture Choices
Sessa embeds attention inside a recurrent loop, outperforming Transformer and Mamba on long-context tasks. The interaction topology matters more than the attention-SSM ratio.
Infrastructure & Runtime
vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe
vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.