Groundy — Infrastructure & Runtime

Groundy — Infrastructure & RuntimeThe serving stack, network fabric, and cloud-account substrate beneath production AI, where every throughput claim collides with rebuild windows, egress invoices, and control-plane risk.https://groundy.com/en-usThe Viral AWS Support Post Is a Warning About Cloud Escalation Pathshttps://groundy.com/articles/the-viral-aws-support-post-is-a-warning-about-cloud-escalation-paths/https://groundy.com/articles/the-viral-aws-support-post-is-a-warning-about-cloud-escalation-paths/A viral AWS support post exposed a structural pattern: hyperscalers thinning human escalation paths, raising risk for single-vendor teams reliant on reaching a person.Fri, 29 May 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000Zaws-supportcloud-escalationsingle-vendor-riskhyperscaler-automationincident-responsecloud-infrastructureGroundy EditorialGemma 4 31B on Cloud TPU vs GPU: The Serving Cost Crossover Pointhttps://groundy.com/articles/gemma-4-31b-on-cloud-tpu-vs-gpu-the-serving-cost-crossover-point/https://groundy.com/articles/gemma-4-31b-on-cloud-tpu-vs-gpu-the-serving-cost-crossover-point/TPU v6e Flex-start delivers 308M tokens per dollar for Gemma 4 31B prefill, undercutting H100 rates for open-weight serving, but production decode costs remain unquantified.Wed, 27 May 2026 00:00:00 GMTGroundy Editorial2026-05-27T00:00:00.000Zinference-costgemma-4tpugpu-inferencecloud-tpuopen-weight-modelsGroundy EditorialCloudflare Flagship Is a Feature Flag Service That Deepens Platform Gravityhttps://groundy.com/articles/cloudflare-flagship-is-a-feature-flag-service-that-deepens-platform-gravity/https://groundy.com/articles/cloudflare-flagship-is-a-feature-flag-service-that-deepens-platform-gravity/Cloudflare's Flagship is a feature flag service with a native Workers binding that replaces third-party flag providers, consolidating more of the edge stack under one vendor.Wed, 27 May 2026 00:00:00 GMTGroundy Editorial2026-05-27T00:00:00.000Zfeature-flagscloudflare-workersedge-computingplatform-consolidationvendor-lock-inopenfeatureGroundy EditorialWhy LLMs Still Botch Kubernetes Manifests: The Training-Data Gaphttps://groundy.com/articles/why-llms-still-botch-kubernetes-manifests-the-training-data-gap/https://groundy.com/articles/why-llms-still-botch-kubernetes-manifests-the-training-data-gap/A 1.5B-parameter model hits 91.5% on Kubernetes YAML generation, but the remaining failures are syntactically valid manifests that deploy and quietly violate cluster intent.Wed, 27 May 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zkubernetesllm-code-generationyamlfine-tuningdevopsplatform-engineeringGroundy EditorialObjectCache Moves KV Reuse to S3-Class Storage: Why Layerwise Retrieval Beats Full-Prefix Cache Hitshttps://groundy.com/articles/objectcache-moves-kv-reuse-to-s3-class-storage-why-layerwise-retrieval-beats/https://groundy.com/articles/objectcache-moves-kv-reuse-to-s3-class-storage-why-layerwise-retrieval-beats/ObjectCache retrieves KV cache per-layer from S3, adding 5.6% TTFT at 64K context but 56-75 ms at 4K. Long-context deployments where DRAM is the bottleneck benefit most.Tue, 26 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zkv-cachevllminference-optimizationobject-storagellm-servingsglangGroundy EditorialVercel's CDN Origin Timeout Jumps to 2 Minutes: A Concession to LLM Streaming Workloadshttps://groundy.com/articles/vercels-cdn-origin-timeout-jumps-to-2-minutes-a-concession-to-llm-streaming/https://groundy.com/articles/vercels-cdn-origin-timeout-jumps-to-2-minutes-a-concession-to-llm-streaming/Vercel raised its CDN origin timeout from 30s to 120s to support LLM streaming, removing a constraint that forced teams to route AI traffic through separate infrastructure.Tue, 26 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zcdnllm-streamingvercelserverlessedge-computingsseGroundy EditorialFluid Compute vs PgBouncer: Vercel's Undocumented Bet on Connection Reusehttps://groundy.com/articles/fluid-compute-vs-pgbouncer-vercels-undocumented-bet-on-connection-reuse/https://groundy.com/articles/fluid-compute-vs-pgbouncer-vercels-undocumented-bet-on-connection-reuse/Vercel's Fluid Compute claims to hold Postgres connections open across requests, potentially eliminating PgBouncer, but the claim lacks published technical specs.Tue, 26 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zvercel-fluid-computeconnection-poolingpostgresqlpgbouncerserverlessinfrastructureGroundy EditorialVercel Fluid Pools Database Connections Across Invocations, Bypassing External Poolershttps://groundy.com/articles/vercel-fluid-pools-database-connections-across-invocations-bypassing-external/https://groundy.com/articles/vercel-fluid-pools-database-connections-across-invocations-bypassing-external/Fluid Compute reuses Postgres connections across warm invocations via attachDatabasePool, dropping the pooler for simple apps but not for shared-database architectures.Mon, 25 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zvercel-fluidconnection-poolingpostgresserverlesspgbouncerdatabase-infrastructureGroundy EditorialRailway's GCP Suspension Is a Reseller PaaS Problem, Not a Google Onehttps://groundy.com/articles/railways-gcp-suspension-is-a-reseller-paas-problem-not-a-google-one/https://groundy.com/articles/railways-gcp-suspension-is-a-reseller-paas-problem-not-a-google-one/Railway's eight-hour outage shows why every reseller PaaS with a single upstream account is one billing flag away from total blackout, and what teams should audit now.Tue, 26 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zreseller-paascloud-infrastructureservice-discoverygcpplatform-reliabilityincident-managementGroundy EditorialVercel CDN Request Collapsing: One Origin Fetch Per ISR Cache Misshttps://groundy.com/articles/vercel-cdn-request-collapsing-one-origin-fetch-per-isr-cache-miss/https://groundy.com/articles/vercel-cdn-request-collapsing-one-origin-fetch-per-isr-cache-miss/Vercel's CDN collapses concurrent ISR cache misses into one function invocation per region, reducing origin load but leaving dynamic routes and external origins exposed.Mon, 25 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zisrvercelcdnrequest-collapsingnextjscapacity-planningedge-cachingGroundy EditorialCISA Admin Leaked AWS GovCloud Keys on GitHub: What Federal Secret Scanning Missedhttps://groundy.com/articles/cisa-admin-leaked-aws-govcloud-keys-on-github-what-federal-secret-scanning/https://groundy.com/articles/cisa-admin-leaked-aws-govcloud-keys-on-github-what-federal-secret-scanning/A CISA contractor left live AWS GovCloud admin keys on a public GitHub repo for six months, exposing gaps in federal credential hygiene and FedRAMP boundary monitoring.Mon, 25 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zaws-govcloudcisacredential-leakfedrampsecret-scanninggithub-securityGroundy EditorialWhat Cloudflare's Q1 2026 Outage Data Says About Designing for State-Level Shutdownshttps://groundy.com/articles/what-cloudflares-q1-2026-outage-data-says-about-designing-for-state-level/https://groundy.com/articles/what-cloudflares-q1-2026-outage-data-says-about-designing-for-state-level/Three state-ordered shutdowns, drone strikes on AWS data centers, and grid collapses in Q1 2026 prove that multi-region failover cannot survive country-level failure domains.Sun, 24 May 2026 00:00:00 GMTGroundy Editorial2026-05-24T00:00:00.000Zinternet-shutdownsdisaster-recoverycloud-infrastructurecloudflarebgp-monitoringmulti-region-failoverGroundy EditorialRailway's May 19 GCP Suspension Exposes the Single-Account Risk Underneath Every Reseller PaaShttps://groundy.com/articles/railways-may-19-gcp-suspension-exposes-the-single-account-risk-underneath-every/https://groundy.com/articles/railways-may-19-gcp-suspension-exposes-the-single-account-risk-underneath-every/Railway's eight-hour outage shows that multi-cloud data planes mean nothing when the control plane lives on a single provider account that can be suspended without notice.Sat, 23 May 2026 00:00:00 GMTGroundy Editorial2026-05-23T00:00:00.000Zgcprailwaycloud-outagecontrol-planepaasmulti-cloudinfrastructure-resilienceGroundy EditorialvLLM 0.21 Makes Prefill-Decode Disaggregation Actually Practicalhttps://groundy.com/articles/vllm-v0-21-adds-bi-directional-kv-cache-transfers-between-prefill-and-decode/https://groundy.com/articles/vllm-v0-21-adds-bi-directional-kv-cache-transfers-between-prefill-and-decode/vLLM v0.21 reportedly adds bi-directional KV cache transfers between prefill and decode nodes, making P/D ratios dynamic and requiring new NIXL transfer telemetry.Sat, 23 May 2026 00:00:00 GMTGroundy Editorial2026-05-23T00:00:00.000Zvllmkv-cachedisaggregated-servingnixlinference-infrastructuregpu-schedulingGroundy EditorialDMax Hits 1,338 Tokens/Sec on 2x H200: Parallel Decoding Pushes dLLM Serving Past the Autoregressive Barhttps://groundy.com/articles/dmax-hits-1-338-tokens-sec-on-2x-h200-parallel-decoding-pushes-dllm-serving/https://groundy.com/articles/dmax-hits-1-338-tokens-sec-on-2x-h200-parallel-decoding-pushes-dllm-serving/DMax reformulates diffusion LLM decoding as embedding refinement, achieving 1,338 tok/s on 2× H200 and challenging ParallelBench's parallel-decoding quality trade-off finding.Tue, 19 May 2026 00:00:00 GMTGroundy Editorial2026-05-19T00:00:00.000Zdiffusion-llmparallel-decodingllm-servinginferencesglangthroughputlladaGroundy EditorialKioxia and Dell's 10 PB in 2RU: What Storage Density Means for Cluster Power and Rebuild Windowshttps://groundy.com/articles/kioxia-and-dells-10-pb-in-2ru-what-storage-density-means-for-cluster-power/https://groundy.com/articles/kioxia-and-dells-10-pb-in-2ru-what-storage-density-means-for-cluster-power/Kioxia and Dell packed 9.8 PB into a 2U server. At 245 TB per drive, rebuilds take 14-27 hours, forcing teams to retune erasure coding for production clusters.Mon, 18 May 2026 00:00:00 GMTGroundy Editorial2026-05-18T00:00:00.000Zstorage-densitynvmedata-centererasure-codingflash-storagedellssdGroundy EditorialKV Cache Offloading Breaks on Context-Intensive Tasks: Text2JSON Exposes the Landmark Failure Modehttps://groundy.com/articles/kv-cache-offloading-breaks-on-context-intensive-tasks-text2json-exposes/https://groundy.com/articles/kv-cache-offloading-breaks-on-context-intensive-tasks-text2json-exposes/ShadowKV-style KV cache offloading methods pass NIAH and RULER but collapse on synthesis tasks. Text2JSON quantifies the gap; YAKV's per-key selection fixes it.Mon, 18 May 2026 00:00:00 GMTGroundy Editorial2026-05-18T00:00:00.000Zkv-cacheinferencelong-contextllm-benchmarkskv-offloadingquantizationGroundy EditorialCrawshaw's 'I Am Building a Cloud': What a Tailscale Co-Founder's Solo Stack Implies for Platform Teamshttps://groundy.com/articles/crawshaws-i-am-building-a-cloud-what-a-tailscale-co-founders-solo-stack-implies/https://groundy.com/articles/crawshaws-i-am-building-a-cloud-what-a-tailscale-co-founders-solo-stack-implies/David Crawshaw's exe.dev launched with $35M, giving platform teams a concrete alternative to the Kubernetes default that forces TCO justification for cloud-native overhead.Wed, 29 Apr 2026 00:00:00 GMTGroundy Editorial2026-04-29T00:00:00.000Zinfrastructurekubernetescloud-costsbare-metalplatform-engineeringtailscaleexe-devGroundy EditorialUCCL-Zip: Lossless Compression for NCCL, 47.5% Faster RL Sync, 10% Lower vLLM Latencyhttps://groundy.com/articles/uccl-zip-brings-lossless-compression-to-nccl-collectives-475-faster-rl-weight/https://groundy.com/articles/uccl-zip-brings-lossless-compression-to-nccl-collectives-475-faster-rl-weight/UCCL-Zip fuses lossless compression into NCCL and GPU P2P transfers, cutting RL weight sync by 47.5% and vLLM latency by 10% with no API changes and bit-identical outputs.Fri, 24 Apr 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Znccllossless-compressiongpu-communicationvllmrl-trainingprefill-decodedistributed-inferenceGroundy EditorialIngress-Nginx Is Dead, Not Deprecated: Final CVE Patches Shipped, But Platform Teams Need a Migration Planhttps://groundy.com/articles/ingress-nginx-is-dead-not-deprecated-the-final-cve-patches-shipped-but-platform/https://groundy.com/articles/ingress-nginx-is-dead-not-deprecated-the-final-cve-patches-shipped-but-platform/ingress-nginx was retired March 24, 2026. CVE-2026-4342 patches shipped March 19, but no future fixes are coming. How platform teams should pick a migration path.Thu, 23 Apr 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zkubernetesnetworkingingressgateway-apisecuritymigrationGroundy EditorialMLX vs llama.cpp on Apple Silicon: Which Runtime to Use for Local LLM Inferencehttps://groundy.com/articles/mlx-vs-llamacpp-on-apple-silicon-which-runtime-to-use-for-local-llm-inference/https://groundy.com/articles/mlx-vs-llamacpp-on-apple-silicon-which-runtime-to-use-for-local-llm-inference/MLX delivers 20-87% faster generation on Apple Silicon for models under 14B parameters. llama.cpp wins for cross-platform use and long contexts.Tue, 24 Mar 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000ZMLXllama.cppApple Siliconon-device inferencemacOSlocal LLMGroundy EditorialPrefill-Decode Disaggregation: The Architecture Shift Redefining LLM Serving at Scalehttps://groundy.com/articles/prefill-decode-disaggregation-the-architecture-shift-redefining-llm-serving-at-scale/https://groundy.com/articles/prefill-decode-disaggregation-the-architecture-shift-redefining-llm-serving-at-scale/Prefill-decode disaggregation separates compute-bound prefill from memory-bound decode onto dedicated hardware, eliminating phase interference.Tue, 24 Mar 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000ZLLM servingprefilldecode disaggregationinference infrastructureMooncakeGroundy EditorialGoogle LiteRT: Running LLMs on Your Phone Without the Cloudhttps://groundy.com/articles/google-litert-running-llms-your-phone-without/https://groundy.com/articles/google-litert-running-llms-your-phone-without/Google's LiteRT (formerly TensorFlow Lite) is now the production backbone for on-device GenAI across Android, Chrome, and Pixel devices. Here's what it means for developers building AI apps that run privately, without the cloud.Sun, 15 Mar 2026 00:00:00 GMTGroundy Editorial2026-05-15T00:00:00.000Zai-infrastructuremobileedge-aiGroundy EditorialMicrosoft's BitNet: How 1-Bit LLMs Could Make GPU Farms Obsoletehttps://groundy.com/articles/microsoft-s-bitnet-how-1-bit-llms-could-make-gpu-farms/https://groundy.com/articles/microsoft-s-bitnet-how-1-bit-llms-could-make-gpu-farms/Microsoft's BitNet inference framework runs billion-parameter LLMs on ordinary CPUs using ternary weights, delivering up to 6x faster inference and 82% lower energy consumption, potentially upending the assumption that AI inference requires expensive GPU hardware.Fri, 13 Mar 2026 00:00:00 GMTGroundy Editorialai-infrastructuremodelshardwareGroundy EditorialOpenRAG: The Open-Source RAG Platform Challenging Pineconehttps://groundy.com/articles/openrag-open-source-rag-platform-challenging/https://groundy.com/articles/openrag-open-source-rag-platform-challenging/OpenRAG combines Langflow, OpenSearch, and Docling into a single deployable RAG platform. Here's how it compares to managed services like Pinecone.Fri, 27 Mar 2026 00:00:00 GMTGroundy Editorialai-infrastructureragopen-sourceGroundy EditorialWebAssembly AI: Running Models in the Browserhttps://groundy.com/articles/webassembly-ai-running-models/https://groundy.com/articles/webassembly-ai-running-models/WebAssembly enables production-ready AI inference directly in the browser, no server required. Learn how WASM, WebGPU, and modern frameworks make client-side ML practical, what the performance trade-offs actually look like, and when to use it.Sat, 28 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-18T00:00:00.000Zwebassemblybrowser-aiGroundy EditorialDNS-Persist-01 Validation: Let's Encrypt's Model for Permanent ACME Certificate Authorizationhttps://groundy.com/articles/letsencrypt-dns-persist-01/https://groundy.com/articles/letsencrypt-dns-persist-01/DNS-Persist-01 proposes persistent DNS TXT records for ACME certificate validation, removing per-renewal DNS updates as certificate lifetimes shrink toward 47 days by 2029.Thu, 19 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-25T00:00:00.000Zweb-platformsecurityinfrastructuressldnsGroundy EditorialTailscale Peer Relays: The Missing Piece for True P2P Networkinghttps://groundy.com/articles/tailscale-peer-relays-missing-piece-true-p2p/https://groundy.com/articles/tailscale-peer-relays-missing-piece-true-p2p/Tailscale Peer Relays became generally available on February 18, 2026, enabling high-throughput peer-to-peer relaying within your own infrastructure. This feature eliminates the performance bottleneck of DERP servers when NAT traversal fails, delivering true mesh networking even in restrictive network environments.Thu, 19 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-11T00:00:00.000Znetworkingvpninfrastructurep2pGroundy EditorialPerplexity API: Adding Real-Time Search to Your Apps in Minuteshttps://groundy.com/articles/perplexity-api-adding-real-time-search-your-apps/https://groundy.com/articles/perplexity-api-adding-real-time-search-your-apps/A comprehensive guide to implementing Perplexity's Search API, featuring pricing, code examples, use cases, and comparisons with alternatives.Sun, 15 Feb 2026 00:00:00 GMTGroundy Editorialperplexityapiai-infrastructuresearchreal-timellmweb-searchGroundy EditorialThe Complete Guide to Local LLMs in 2026https://groundy.com/articles/the-complete-guide-to-local-llms/https://groundy.com/articles/the-complete-guide-to-local-llms/Why running AI on your own hardware is becoming the default choice for privacy-conscious developers and enterprises alikeThu, 12 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-14T00:00:00.000Zlocal-aiollamallama.cppvllmprivacyself-hosted-aiGroundy Editorial