Category

Infrastructure & Runtime

37 articles exploring Infrastructure & Runtime. Expert analysis and insights from our editorial team.

Showing 1–15 of 37 articles · Page 1 of 3

Where AI models run determines everything about latency, cost, privacy, and operational risk. This cluster covers the runtime and serving layer: inference optimization, hardware tradeoffs, RAG architectures, vector search at scale, and the growing ecosystem of edge and on-device deployment.

Serving-side architecture has undergone a genuine paradigm shift with prefill-decode disaggregation. The insight—that prefill is compute-bound while decode is memory-bandwidth-bound, and routing them to different hardware pools eliminates the phase interference that inflates P99 latency—is now being productized by NVIDIA Dynamo, vLLM’s disaggregated serving, and Mooncake at ByteDance scale. If you’re running inference at volume, this architectural decision is no longer academic.

The local and edge stack has consolidated around a shorter list of serious runtimes. MLX performs significantly better than llama.cpp on Apple Silicon for sub-14B models while llama.cpp remains the right call for cross-platform deployments and long contexts. Google’s LiteRT (successor to TensorFlow Lite) anchors the Android/embedded side. The tradeoffs are measurable; Groundy publishes benchmark results rather than vendor summaries.

RAG production architecture is where theory consistently meets deployment reality. The gap between a notebook demo and a system that handles document poisoning, retrieval precision degradation under index growth, and embedding drift over time is where most RAG projects stall. Groundy covers the failure modes—not just the happy-path architecture diagrams.

Hardware selection is increasingly a first-class decision. Microsoft’s BitNet 1-bit quantization, NVIDIA’s open-source quantum-calibration models, and Alibaba’s ZVEC vector database each represent architectural bets on where the cost curves are heading. This cluster tracks those bets.

Serving infrastructure also intersects directly with security posture. Container deployments inherit every existing container vulnerability alongside a new class of AI-specific threats: model-weight theft, prompt injection through sidecar services, and supply-chain attacks targeting the Python dependencies that wrap inference engines. Infrastructure coverage at Groundy treats the operational and security dimensions as a single problem, not separate lanes.

Featured in this cluster

Latest in Infrastructure & Runtime

Newest first
01

DMax Hits 1,338 Tokens/Sec on 2x H200: Parallel Decoding Pushes dLLM Serving Past the Autoregressive Bar

DMax reformulates diffusion LLM decoding as embedding refinement, achieving 1,338 tok/s on 2× H200 and challenging ParallelBench's parallel-decoding quality trade-off finding.

02

KV Cache Offloading Breaks on Text2JSON: Why Llama 3 and Qwen 3 Lose Accuracy on Context-Intensive Prompts

Four KV cache offloading methods show accuracy drops on Llama 3 and Qwen 3 in Text2JSON's multi-needle extraction tasks, a gap that TTFT-only benchmark suites don't detect.

03

Kioxia and Dell's 10 PB in 2RU: What Storage Density Means for Cluster Power and Rebuild Windows

Kioxia and Dell packed 9.8 PB into a 2U server. At 245 TB per drive, rebuilds take 14-27 hours, forcing teams to retune erasure coding for production clusters.

04

KV Cache Offloading Breaks on Context-Intensive Tasks: Text2JSON Exposes the Landmark Failure Mode

ShadowKV-style KV cache offloading methods pass NIAH and RULER but collapse on synthesis tasks. Text2JSON quantifies the gap; YAKV's per-key selection fixes it.

05

Crawshaw's 'I Am Building a Cloud': What a Tailscale Co-Founder's Solo Stack Implies for Platform Teams

David Crawshaw's exe.dev launched with $35M, giving platform teams a concrete alternative to the Kubernetes default that forces TCO justification for cloud-native overhead.

06

Azure NAT Gateway Blocks Tailscale Direct Connect; v1.96.2 Fixes Container Relay Scaling for AKS

Azure NAT Gateway's Hard NAT forces Tailscale onto DERP; a public-subnet Peer Relay bypasses it. v1.96.2 fixes container GOMAXPROCS socket scaling for AKS relay instances.

07

K-Token Merging Compresses Sequences in Latent Space, Lowering KV Cache Floors for 24GB and 48GB Cards

K-Token Merging compresses prompts in latent space before attention, cutting prefill KV cache 75% on 0.5B models and extending feasible context on 24GB and 48GB consumer GPUs.

08

KServe + llm-d Claims 57× P90 TTFT. RC1 Ships with a Routing Deadlock and No Migration Guide

Red Hat's KServe + llm-d integration claims 57× P90 TTFT gains against an unoptimized vLLM baseline, but RC1 ships with a known routing deadlock, a prematurely merged WIP.

09

UCCL-Zip Adds Lossless Compression to NCCL Collectives: 47.5% Faster RL Weight Sync, No API Changes

UCCL-Zip fuses lossless compression into NCCL collectives at the kernel level, cutting cross-node wire bytes without accuracy tradeoffs or application changes. Peak gains:.

10

UCCL-Zip: Lossless Compression for NCCL, 47.5% Faster RL Sync, 10% Lower vLLM Latency

UCCL-Zip fuses lossless compression into NCCL and GPU P2P transfers, cutting RL weight sync by 47.5% and vLLM latency by 10% with no API changes and bit-identical outputs.

11

CoCoDiff Exposes the All-to-All Bottleneck That Caps Distributed Diffusion Transformer Inference Well Below Theoretical GPU Count

Ulysses parallelism caps distributed DiT inference scaling on heterogeneous interconnects. CoCoDiff delivers 3.6x average speedups on Aurora via topology-aware scheduling.

12

Ingress-Nginx Is Dead, Not Deprecated: The Final CVE Patches Shipped, But Platform Teams Still Need a Migration Plan

ingress-nginx was retired March 24, 2026. CVE-2026-4342 patches shipped March 19, but no future fixes are coming. How platform teams should pick a migration path.

13

Tailscale Peer Relays Behind Azure NAT Gateway: Why the DERP Fallback Hides a Throughput Cliff

Azure NAT Gateway silently forces Tailscale into DERP relay fallback, capping throughput. A Peer Relay in a public subnet with a static UDP endpoint restores direct-path.

14

vLLM Block-Level Preemption and FlexKV Shift the Long-Context Bottleneck From GPU Memory to PCIe

vLLM v0.19 block preemption and v0.18 FlexKV shift the long-context bottleneck from GPU memory to PCIe and CPU cache, but require experimental flags and carry unresolved.

15

KV Cache Is Becoming a Distributed Infrastructure Layer: What KV Packet and llm-d Mean for Self-Hosted LLM Teams

KV Packet eliminates cross-request recomputation; llm-d brings cache-aware routing to Kubernetes. Here's what both mean for vLLM capacity planning.

· 6 min read

Explore More Categories

Discover insights across different technology domains.

Browse All Articles