groundy

infrastructure & runtime

53 articles · rss

Top in infrastructure & runtime


  1. jun 09 infra MiniMax M3 Ships 1M Context and Desktop Control as Open Weights
  2. jun 09 infra DeepSeek-V4 FlashMemory: Sparse Attention for Million-Token Context
  3. jun 08 infra Is Cloudflare's Bot Traffic Surge Real? The Measurement Dispute
  4. jun 08 infra Huawei's KVarN Puts KV-Cache Quantization Inside vLLM's Backend
  5. jun 06 infra Indexing Images for RAG: kapa.ai's Approach to Multimodal Retrieval
  6. jun 05 infra The RTX Spark Bet on Unified Memory for Local LLMs: Where Bandwidth Caps It
  7. jun 05 infra Reading Vercel's Fluid Compute vs Cloudflare Workers Benchmark
  8. jun 05 infra Does CUDA Tile Match Hand-Tuned Kernels on Hopper and Blackwell?
  9. jun 05 infra Pod-Level Remote Attestation in Kubernetes: Confidential Workloads on dstack
  10. jun 04 infra Generating GPU Kernels for Moore Threads Silicon: Can LLMs Break CUDA Lock-In?
  11. jun 04 infra Microsoft's Azure Linux Goes General-Purpose: The Container Base-Image Play
  12. jun 04 infra Cloudflare Acquires VoidZero, the Company Behind Vite's Rust Toolchain
  13. jun 04 infra Putting a Datacenter V100 in a Gaming PC: The Local LLM Math
  14. jun 03 infra Cost-Aware RAG Routing: When Deeper Retrieval Stops Paying Off
  15. jun 02 infra Using Your Nvidia GPU's VRAM as Linux Swap: Where the NBD Hack Breaks Down
  16. may 30 infra Cloudflare Turnstile Now Fingerprints WebGL: The Privacy CAPTCHA Tradeoff
  17. may 28 infra The Viral AWS Support Post Is a Warning About Cloud Escalation Paths
  18. may 26 infra Why LLMs Still Botch Kubernetes Manifests: The Training-Data Gap
  19. may 26 infra Cloudflare Flagship Is a Feature Flag Service That Deepens Platform Gravity
  20. may 26 infra Gemma 4 31B on Cloud TPU vs GPU: The Serving Cost Crossover Point
  21. may 25 infra ObjectCache Moves KV Reuse to S3-Class Storage: Why Layerwise Retrieval Beats Full-Prefix Cache Hits
  22. may 25 infra Vercel's CDN Origin Timeout Jumps to 2 Minutes: A Concession to LLM Streaming Workloads
  23. may 25 infra Fluid Compute vs PgBouncer: Vercel's Undocumented Bet on Connection Reuse
  24. may 25 infra Railway's GCP Suspension Is a Reseller PaaS Problem, Not a Google One
  25. may 24 infra Vercel Fluid Pools Database Connections Across Invocations, Bypassing External Poolers
  26. may 24 infra Vercel CDN Request Collapsing: One Origin Fetch Per ISR Cache Miss
  27. may 24 infra CISA Admin Leaked AWS GovCloud Keys on GitHub: What Federal Secret Scanning Missed
  28. may 23 infra What Cloudflare's Q1 2026 Outage Data Says About Designing for State-Level Shutdowns
  29. may 22 infra Railway's May 19 GCP Suspension Exposes the Single-Account Risk Underneath Every Reseller PaaS
  30. may 22 infra vLLM 0.21 Makes Prefill-Decode Disaggregation Actually Practical
  31. may 18 infra DMax Hits 1,338 Tokens/Sec on 2x H200: Parallel Decoding Pushes dLLM Serving Past the Autoregressive Bar
  32. may 17 infra Kioxia and Dell's 10 PB in 2RU: What Storage Density Means for Cluster Power and Rebuild Windows
  33. may 17 infra KV Cache Offloading Breaks on Context-Intensive Tasks: Text2JSON Exposes the Landmark Failure Mode
  34. apr 28 infra Crawshaw's 'I Am Building a Cloud': What a Tailscale Co-Founder's Solo Stack Implies for Platform Teams
  35. apr 23 infra UCCL-Zip: Lossless Compression for NCCL, 47.5% Faster RL Sync, 10% Lower vLLM Latency
  36. apr 22 infra Ingress-Nginx Is Dead, Not Deprecated: Final CVE Patches Shipped, But Platform Teams Need a Migration Plan
  37. mar 26 infra OpenRAG: The Open-Source RAG Platform Challenging Pinecone
  38. mar 23 infra MLX vs llama.cpp on Apple Silicon: Which Runtime to Use for Local LLM Inference
  39. mar 23 infra Prefill-Decode Disaggregation: The Architecture Shift Redefining LLM Serving
  40. mar 14 infra Google LiteRT: Running LLMs on Your Phone Without the Cloud
  41. mar 12 infra Microsoft's BitNet: How 1-Bit LLMs Could Make GPU Farms Obsolete
  42. feb 27 infra WebAssembly AI: Running Models in the Browser
  43. feb 18 infra Tailscale Peer Relays: The Missing Piece for True P2P Networking
  44. feb 18 infra DNS-Persist-01 Validation: Let's Encrypt's Model for Permanent ACME Certificate Authorization
  45. feb 11 infra The Complete Guide to Local LLMs

Production AI runs on infrastructure that was never designed for it. Inference serving is a moving target as prefill and decode pull apart onto different hardware, KV caches spill into tiered storage, and collective communication libraries get rewritten to claw back bandwidth. Every benchmark win on synthetic workloads has to survive long-context synthesis, multi-tenant interference, and the unglamorous math of tokens-per-dollar before it counts.

The fabric underneath is just as contested. Vector databases are converging with the OLTP stack, serverless runtimes are quietly absorbing what connection poolers used to own, and overlay networks keep colliding with cloud-provider NAT and egress policy in ways that turn architecture diagrams into invoices. Storage density is outrunning rebuild windows, forcing erasure-coding choices that used to be theoretical. Cheaper-inference research keeps threatening the assumption that scale must mean GPU farms, while denser GPU farms keep proving it.

This beat covers that tension on the merits. We track serving architectures, networking and peering economics, retrieval and caching layers, GPU and storage hardware, and the cloud-account dependencies that quietly underwrite the whole stack. We compare vendor claims against published numbers, flag when a throughput headline hides a quality regression, and pay attention to the boring failure modes that take down platforms more often than the exciting ones do.