infrastructure & runtime
Top in infrastructure & runtime
Running Long-Context Agents on a 4-Bit KV Cache: Where Accuracy Breaks
UltraQuant cuts agent time-to-first-token 3.47x with 4-bit KV caching on AMD CDNA4, but its June 2026 preprint omits the accuracy numbers operators need to ship it.
infraWhen LLM-Generated CUDA Kernels Pass Tests but Get the Math Wrong
LLM-written CUDA kernels compile, run, and pass smoke tests while returning wrong numerics, so crash-free execution is not enough to trust AI-generated GPU code.
Running GLM-5.2 at Home: SGLang, vLLM, Transformers, and KTransformers Setup Guide
GLM-5.2 weights are live on HuggingFace under MIT license: 753B MoE, 1M-token context, FP8 and BF16 variants. How to pick a deployment framework and model the hardware cost.
infraAWS Bedrock Now Requires Data Sharing for Mythos: The Self-Hosting Calculus
AWS Bedrock's provider_data_share gate for Mythos-class models removes the in-AWS data boundary regulated teams bought it for, pushing them toward self-hosted serving.
infravLLM Cold Start Latency: Why Scale-to-Zero LLM Serving Stalls
A June 2026 MLSys paper breaks vLLM cold start into six CPU-bound boot phases, showing why scale-to-zero serving forces operators back into warm GPU pools.
infraThe Vercel-AWS Deal Reveals Where AI Inference Runs
Vercel's May 2026 AWS databases integration clarifies where its AI workloads actually run: inference stays behind external APIs while the stateful tier moves to AWS regions.
infraRunning RAG on a Snapdragon NPU: The On-Device Retrieval Tradeoff
End-to-end RAG on the Snapdragon X Elite Hexagon NPU delivers 4x lower latency and 4x less energy than CPU with no quality loss, but soldered memory caps your index size.
infraGraphRAG vs VectorRAG: Does the Graph Index Earn Its Cost?
A Samsung preprint finds vector retrieval matches GraphRAG on QA tasks at a fraction of the indexing cost, shifting the burden of proof to teams building graph pipelines.
- jun 09 infra MiniMax M3 Ships 1M Context and Desktop Control as Open Weights
- jun 09 infra DeepSeek-V4 FlashMemory: Sparse Attention for Million-Token Context
- jun 08 infra Is Cloudflare's Bot Traffic Surge Real? The Measurement Dispute
- jun 08 infra Huawei's KVarN Puts KV-Cache Quantization Inside vLLM's Backend
- jun 06 infra Indexing Images for RAG: kapa.ai's Approach to Multimodal Retrieval
- jun 05 infra The RTX Spark Bet on Unified Memory for Local LLMs: Where Bandwidth Caps It
- jun 05 infra Reading Vercel's Fluid Compute vs Cloudflare Workers Benchmark
- jun 05 infra Does CUDA Tile Match Hand-Tuned Kernels on Hopper and Blackwell?
- jun 05 infra Pod-Level Remote Attestation in Kubernetes: Confidential Workloads on dstack
- jun 04 infra Generating GPU Kernels for Moore Threads Silicon: Can LLMs Break CUDA Lock-In?
- jun 04 infra Microsoft's Azure Linux Goes General-Purpose: The Container Base-Image Play
- jun 04 infra Cloudflare Acquires VoidZero, the Company Behind Vite's Rust Toolchain
- jun 04 infra Putting a Datacenter V100 in a Gaming PC: The Local LLM Math
- jun 03 infra Cost-Aware RAG Routing: When Deeper Retrieval Stops Paying Off
- jun 02 infra Using Your Nvidia GPU's VRAM as Linux Swap: Where the NBD Hack Breaks Down
- may 30 infra Cloudflare Turnstile Now Fingerprints WebGL: The Privacy CAPTCHA Tradeoff
- may 28 infra The Viral AWS Support Post Is a Warning About Cloud Escalation Paths
- may 26 infra Why LLMs Still Botch Kubernetes Manifests: The Training-Data Gap
- may 26 infra Cloudflare Flagship Is a Feature Flag Service That Deepens Platform Gravity
- may 26 infra Gemma 4 31B on Cloud TPU vs GPU: The Serving Cost Crossover Point
- may 25 infra ObjectCache Moves KV Reuse to S3-Class Storage: Why Layerwise Retrieval Beats Full-Prefix Cache Hits
- may 25 infra Vercel's CDN Origin Timeout Jumps to 2 Minutes: A Concession to LLM Streaming Workloads
- may 25 infra Fluid Compute vs PgBouncer: Vercel's Undocumented Bet on Connection Reuse
- may 25 infra Railway's GCP Suspension Is a Reseller PaaS Problem, Not a Google One
- may 24 infra Vercel Fluid Pools Database Connections Across Invocations, Bypassing External Poolers
- may 24 infra Vercel CDN Request Collapsing: One Origin Fetch Per ISR Cache Miss
- may 24 infra CISA Admin Leaked AWS GovCloud Keys on GitHub: What Federal Secret Scanning Missed
- may 23 infra What Cloudflare's Q1 2026 Outage Data Says About Designing for State-Level Shutdowns
- may 22 infra Railway's May 19 GCP Suspension Exposes the Single-Account Risk Underneath Every Reseller PaaS
- may 22 infra vLLM 0.21 Makes Prefill-Decode Disaggregation Actually Practical
- may 18 infra DMax Hits 1,338 Tokens/Sec on 2x H200: Parallel Decoding Pushes dLLM Serving Past the Autoregressive Bar
- may 17 infra Kioxia and Dell's 10 PB in 2RU: What Storage Density Means for Cluster Power and Rebuild Windows
- may 17 infra KV Cache Offloading Breaks on Context-Intensive Tasks: Text2JSON Exposes the Landmark Failure Mode
- apr 28 infra Crawshaw's 'I Am Building a Cloud': What a Tailscale Co-Founder's Solo Stack Implies for Platform Teams
- apr 23 infra UCCL-Zip: Lossless Compression for NCCL, 47.5% Faster RL Sync, 10% Lower vLLM Latency
- apr 22 infra Ingress-Nginx Is Dead, Not Deprecated: Final CVE Patches Shipped, But Platform Teams Need a Migration Plan
- mar 26 infra OpenRAG: The Open-Source RAG Platform Challenging Pinecone
- mar 23 infra MLX vs llama.cpp on Apple Silicon: Which Runtime to Use for Local LLM Inference
- mar 23 infra Prefill-Decode Disaggregation: The Architecture Shift Redefining LLM Serving
- mar 14 infra Google LiteRT: Running LLMs on Your Phone Without the Cloud
- mar 12 infra Microsoft's BitNet: How 1-Bit LLMs Could Make GPU Farms Obsolete
- feb 27 infra WebAssembly AI: Running Models in the Browser
- feb 18 infra Tailscale Peer Relays: The Missing Piece for True P2P Networking
- feb 18 infra DNS-Persist-01 Validation: Let's Encrypt's Model for Permanent ACME Certificate Authorization
- feb 11 infra The Complete Guide to Local LLMs
Production AI runs on infrastructure that was never designed for it. Inference serving is a moving target as prefill and decode pull apart onto different hardware, KV caches spill into tiered storage, and collective communication libraries get rewritten to claw back bandwidth. Every benchmark win on synthetic workloads has to survive long-context synthesis, multi-tenant interference, and the unglamorous math of tokens-per-dollar before it counts.
The fabric underneath is just as contested. Vector databases are converging with the OLTP stack, serverless runtimes are quietly absorbing what connection poolers used to own, and overlay networks keep colliding with cloud-provider NAT and egress policy in ways that turn architecture diagrams into invoices. Storage density is outrunning rebuild windows, forcing erasure-coding choices that used to be theoretical. Cheaper-inference research keeps threatening the assumption that scale must mean GPU farms, while denser GPU farms keep proving it.
This beat covers that tension on the merits. We track serving architectures, networking and peering economics, retrieval and caching layers, GPU and storage hardware, and the cloud-account dependencies that quietly underwrite the whole stack. We compare vendor claims against published numbers, flag when a throughput headline hides a quality regression, and pay attention to the boring failure modes that take down platforms more often than the exciting ones do.