#inference
9 articles exploring inference. Expert insights and analysis from our editorial team.
Articles
K-Token Merging Compresses Sequences in Latent Space, Lowering KV Cache Floors for 24GB and 48GB Cards
K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.
DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU
DuQuant++ adapts outlier-aware rotation to MXFP4, halving online rotation cost on LLaMA 3 and shifting the FP4 deployment bottleneck from memory to calibration engineering.
Google's TPU 8i Targets Agentic Workloads. What CrewAI, LangGraph, and AutoGen Must Measure
Google's TPU 8i adds SRAM and a collectives engine for agentic workloads, yet CrewAI, LangGraph, and AutoGen lack the per-step latency and branch-utilization metrics needed.
LACE Forces vLLM and SGLang to Rethink How Parallel Reasoning Threads Run
LACE lets parallel reasoning threads share state mid-inference, yielding 3-7 point accuracy gains but forcing vLLM and SGLang to abandon independent-sequence batching.
Qwen3.6-27B's Dense Architecture Challenges the MoE-Only Playbook for Flagship-Class Coding Models
Alibaba's dense Qwen3.6-27B outperforms its MoE sibling on coding benchmarks, trading predictable inference latency for a larger memory footprint than sparse alternatives.
vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe
vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.
Self-Correction Comes to Diffusion Models: What SOAR Means for Iterative Image Generation Pipelines
Tencent's SOAR replaces SFT post-training in diffusion models, yielding an 11% GenEval lift on SD3.5-M — no reward model, no preference labels required.
IonRouter (YC W26): The Custom NVIDIA GH200 Runtime Targeting the LLM Inference Cost Crisis
IonRouter (YC W26) built IonAttention, a custom GH200 inference runtime claiming 50% cost cuts and 2x VLM throughput. Here's what the technology actually does.
Executing Programs Inside Transformers: The Inference Breakthrough Nobody Expected
A new architecture from Percepta embeds a program interpreter directly into transformer weights, achieving logarithmic-time execution lookups that could reshape how AI agents handle deterministic computation—if the early claims survive scrutiny.