Topic

#inference

9 articles exploring inference. Expert insights and analysis from our editorial team.

Showing 1–9 of 9 articles

Articles

Newest first
Infrastructure & Runtime

K-Token Merging Compresses Sequences in Latent Space, Lowering KV Cache Floors for 24GB and 48GB Cards

K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.

Models & Research

DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU

DuQuant++ adapts outlier-aware rotation to MXFP4, halving online rotation cost on LLaMA 3 and shifting the FP4 deployment bottleneck from memory to calibration engineering.

Agents & Frameworks

Google's TPU 8i Targets Agentic Workloads. What CrewAI, LangGraph, and AutoGen Must Measure

Google's TPU 8i adds SRAM and a collectives engine for agentic workloads, yet CrewAI, LangGraph, and AutoGen lack the per-step latency and branch-utilization metrics needed.

Developer Tools

LACE Forces vLLM and SGLang to Rethink How Parallel Reasoning Threads Run

LACE lets parallel reasoning threads share state mid-inference, yielding 3-7 point accuracy gains but forcing vLLM and SGLang to abandon independent-sequence batching.

Models & Research

Qwen3.6-27B's Dense Architecture Challenges the MoE-Only Playbook for Flagship-Class Coding Models

Alibaba's dense Qwen3.6-27B outperforms its MoE sibling on coding benchmarks, trading predictable inference latency for a larger memory footprint than sparse alternatives.

Infrastructure & Runtime

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.

Models & Research

Self-Correction Comes to Diffusion Models: What SOAR Means for Iterative Image Generation Pipelines

Tencent's SOAR replaces SFT post-training in diffusion models, yielding an 11% GenEval lift on SD3.5-M — no reward model, no preference labels required.

· 6 min read
Infrastructure & Runtime

IonRouter (YC W26): The Custom NVIDIA GH200 Runtime Targeting the LLM Inference Cost Crisis

IonRouter (YC W26) built IonAttention, a custom GH200 inference runtime claiming 50% cost cuts and 2x VLM throughput. Here's what the technology actually does.

· 8 min read
Models & Research

Executing Programs Inside Transformers: The Inference Breakthrough Nobody Expected

A new architecture from Percepta embeds a program interpreter directly into transformer weights, achieving logarithmic-time execution lookups that could reshape how AI agents handle deterministic computation—if the early claims survive scrutiny.

· 8 min read