<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Groundy — Infrastructure &amp; Runtime</title><description>The serving stack, network fabric, and cloud-account substrate beneath production AI, where every throughput claim collides with rebuild windows, egress invoices, and control-plane risk.</description><link>https://groundy.com/</link><language>en-us</language><atom:link href="https://groundy.com/category/infrastructure/rss.xml" rel="self" type="application/rss+xml"/><item><title>Running RAG on a Snapdragon NPU: The On-Device Retrieval Tradeoff</title><link>https://groundy.com/articles/running-rag-on-a-snapdragon-npu-the-on-device-retrieval-tradeoff/</link><guid isPermaLink="true">https://groundy.com/articles/running-rag-on-a-snapdragon-npu-the-on-device-retrieval-tradeoff/</guid><description>End-to-end RAG on the Snapdragon X Elite Hexagon NPU delivers 4x lower latency and 4x less energy than CPU with no quality loss, but soldered memory caps your index size.</description><pubDate>Sat, 13 Jun 2026 03:20:52 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>rag</category><category>npu</category><category>snapdragon-x-elite</category><category>on-device-inference</category><category>retrieval</category><category>edge-ai</category><author>Groundy Editorial</author></item><item><title>GraphRAG vs VectorRAG: Does the Graph Index Earn Its Cost?</title><link>https://groundy.com/articles/graphrag-vs-vectorrag-does-the-graph-index-earn-its-cost/</link><guid isPermaLink="true">https://groundy.com/articles/graphrag-vs-vectorrag-does-the-graph-index-earn-its-cost/</guid><description>A Samsung preprint finds vector retrieval matches GraphRAG on QA tasks at a fraction of the indexing cost, shifting the burden of proof to teams building graph pipelines.</description><pubDate>Sat, 13 Jun 2026 03:19:17 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>rag</category><category>graphrag</category><category>vector-search</category><category>knowledge-graphs</category><category>information-retrieval</category><category>llm-evaluation</category><author>Groundy Editorial</author></item><item><title>MiniMax M3 Ships 1M Context and Desktop Control as Open Weights</title><link>https://groundy.com/articles/minimax-m3-ships-1m-context-and-desktop-control-as-open-weights/</link><guid isPermaLink="true">https://groundy.com/articles/minimax-m3-ships-1m-context-and-desktop-control-as-open-weights/</guid><description>MiniMax M3 promises open weights with 1M-token context and frontier coding, but BenchLM ranks it #29 overall and #69 on multimodal. Teams need independent verification.</description><pubDate>Wed, 10 Jun 2026 05:48:38 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>minimax-m3</category><category>long-context</category><category>open-weights</category><category>sparse-attention</category><category>code-generation</category><category>model-benchmarks</category><author>Groundy Editorial</author></item><item><title>DeepSeek-V4 FlashMemory: Sparse Attention for Million-Token Context</title><link>https://groundy.com/articles/deepseek-v4-flashmemory-sparse-attention-for-million-token-context/</link><guid isPermaLink="true">https://groundy.com/articles/deepseek-v4-flashmemory-sparse-attention-for-million-token-context/</guid><description>FlashMemory&apos;s learned index compresses DeepSeek-V4&apos;s KV cache to 13.5% of baseline at parity accuracy. The project is suspended; per-suite recall breakdowns are not published.</description><pubDate>Wed, 10 Jun 2026 01:52:16 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>sparse-attention</category><category>kv-cache</category><category>deepseek</category><category>inference-optimization</category><category>long-context</category><category>llm-serving</category><author>Groundy Editorial</author></item><item><title>Is Cloudflare&apos;s Bot Traffic Surge Real? The Measurement Dispute</title><link>https://groundy.com/articles/is-cloudflares-bot-traffic-surge-real-the-measurement-dispute/</link><guid isPermaLink="true">https://groundy.com/articles/is-cloudflares-bot-traffic-surge-real-the-measurement-dispute/</guid><description>Cloudflare claims a 15x bot surge using a classifier that flags privacy browsers as bots. Audit your own logs before trusting the numbers behind Pay-Per-Crawl.</description><pubDate>Tue, 09 Jun 2026 14:59:26 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>cloudflare</category><category>bot-detection</category><category>ai-crawlers</category><category>web-infrastructure</category><category>pay-per-crawl</category><category>web-security</category><author>Groundy Editorial</author></item><item><title>Huawei&apos;s KVarN Puts KV-Cache Quantization Inside vLLM&apos;s Backend</title><link>https://groundy.com/articles/huaweis-kvarn-puts-kv-cache-quantization-inside-vllms-backend/</link><guid isPermaLink="true">https://groundy.com/articles/huaweis-kvarn-puts-kv-cache-quantization-inside-vllms-backend/</guid><description>Huawei&apos;s KVarN replaces vLLM&apos;s attention backend with a 2.3-bit KV-cache quantizer claiming FP16 accuracy on reasoning. Adopters must run a Huawei-maintained fork.</description><pubDate>Tue, 09 Jun 2026 00:15:21 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>kv-cache</category><category>vllm</category><category>quantization</category><category>inference</category><category>huawei</category><category>gpu-memory</category><author>Groundy Editorial</author></item><item><title>Indexing Images for RAG: kapa.ai&apos;s Approach to Multimodal Retrieval</title><link>https://groundy.com/articles/indexing-images-for-rag-kapa-ais-approach-to-multimodal-retrieval/</link><guid isPermaLink="true">https://groundy.com/articles/indexing-images-for-rag-kapa-ais-approach-to-multimodal-retrieval/</guid><description>kapa.ai&apos;s data shows indexing image captions at ingestion adds 1-6% query overhead versus 27-51% for raw query-time vision, shifting recall risk to caption fidelity.</description><pubDate>Sun, 07 Jun 2026 14:43:45 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-07T00:00:00.000Z</atom:updated><category>rag</category><category>multimodal-retrieval</category><category>image-indexing</category><category>vision-models</category><category>technical-documentation</category><category>retrieval-cost</category><author>Groundy Editorial</author></item><item><title>The RTX Spark Bet on Unified Memory for Local LLMs: Where Bandwidth Caps It</title><link>https://groundy.com/articles/the-rtx-spark-bet-on-unified-memory-for-local-llms-where-bandwidth-caps/</link><guid isPermaLink="true">https://groundy.com/articles/the-rtx-spark-bet-on-unified-memory-for-local-llms-where-bandwidth-caps/</guid><description>LLM decode is memory-bandwidth-bound, not capacity-bound. A 70B model on 128-bit LPDDR5X hits roughly 1 tok/s. When evaluating inference hardware, count GB/s, not GB.</description><pubDate>Sat, 06 Jun 2026 21:47:28 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>llm-inference</category><category>memory-bandwidth</category><category>unified-memory</category><category>lpddr5x</category><category>nvidia</category><category>hardware-evaluation</category><author>Groundy Editorial</author></item><item><title>Reading Vercel&apos;s Fluid Compute vs Cloudflare Workers Benchmark</title><link>https://groundy.com/articles/reading-vercels-fluid-compute-vs-cloudflare-workers-benchmark/</link><guid isPermaLink="true">https://groundy.com/articles/reading-vercels-fluid-compute-vs-cloudflare-workers-benchmark/</guid><description>Vercel benchmarks Fluid Compute 2.55x faster than Cloudflare Workers, but asymmetric configs and billing differences (CPU-ms vs GB-hour) make cost the real deciding factor.</description><pubDate>Sat, 06 Jun 2026 20:20:31 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>serverless</category><category>cloudflare-workers</category><category>vercel</category><category>fluid-compute</category><category>benchmarks</category><category>edge-computing</category><category>billing-models</category><author>Groundy Editorial</author></item><item><title>Does CUDA Tile Match Hand-Tuned Kernels on Hopper and Blackwell?</title><link>https://groundy.com/articles/does-cuda-tile-match-hand-tuned-kernels-on-hopper-and-blackwell/</link><guid isPermaLink="true">https://groundy.com/articles/does-cuda-tile-match-hand-tuned-kernels-on-hopper-and-blackwell/</guid><description>CUDA Tile reaches 2.5x FlashAttention-2 on Blackwell B200 but drops to 53% on RTX PRO 6000, while Triton holds 62-101% of cuBLAS across both architectures without tuning.</description><pubDate>Sat, 06 Jun 2026 09:16:56 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>cuda-tile</category><category>gpu-kernels</category><category>blackwell</category><category>triton</category><category>gpu-programming</category><category>nvidia</category><author>Groundy Editorial</author></item><item><title>Pod-Level Remote Attestation in Kubernetes: Confidential Workloads on dstack</title><link>https://groundy.com/articles/pod-level-remote-attestation-in-kubernetes-confidential-workloads-on-dstack/</link><guid isPermaLink="true">https://groundy.com/articles/pod-level-remote-attestation-in-kubernetes-confidential-workloads-on-dstack/</guid><description>dstack-capsule binds pod identity into Intel TDX hardware quotes, enabling multi-pod confidential VMs without the per-VM density tax of Confidential Containers.</description><pubDate>Sat, 06 Jun 2026 04:16:43 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>confidential-computing</category><category>kubernetes</category><category>remote-attestation</category><category>intel-tdx</category><category>confidential-containers</category><category>pod-security</category><author>Groundy Editorial</author></item><item><title>Generating GPU Kernels for Moore Threads Silicon: Can LLMs Break CUDA Lock-In?</title><link>https://groundy.com/articles/generating-gpu-kernels-for-moore-threads-silicon-can-llms-break-cuda-lock/</link><guid isPermaLink="true">https://groundy.com/articles/generating-gpu-kernels-for-moore-threads-silicon-can-llms-break-cuda-lock/</guid><description>MusaCoder trains a 9B model to emit native GPU kernels for Moore Threads&apos; MUSA architecture, claiming parity with frontier models on vendor-controlled benchmarks.</description><pubDate>Fri, 05 Jun 2026 23:15:44 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>gpu-kernels</category><category>moore-threads</category><category>llm-code-generation</category><category>cuda-alternatives</category><category>inference</category><category>musa</category><author>Groundy Editorial</author></item><item><title>Microsoft&apos;s Azure Linux Goes General-Purpose: The Container Base-Image Play</title><link>https://groundy.com/articles/microsofts-azure-linux-goes-general-purpose-the-container-base-image-play/</link><guid isPermaLink="true">https://groundy.com/articles/microsofts-azure-linux-goes-general-purpose-the-container-base-image-play/</guid><description>Microsoft&apos;s Azure Linux 4.0 extends the internal CBL-Mariner into a Fedora-based server OS for VMs. Preview gaps remain, and AKS teams should test now but wait for GA.</description><pubDate>Fri, 05 Jun 2026 18:27:20 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>azure-linux</category><category>kubernetes</category><category>supply-chain</category><category>containers</category><category>fedora</category><category>cloud-infrastructure</category><category>aks</category><author>Groundy Editorial</author></item><item><title>Cloudflare Acquires VoidZero, the Company Behind Vite&apos;s Rust Toolchain</title><link>https://groundy.com/articles/cloudflare-acquires-voidzero-the-company-behind-vites-rust-toolchain/</link><guid isPermaLink="true">https://groundy.com/articles/cloudflare-acquires-voidzero-the-company-behind-vites-rust-toolchain/</guid><description>Cloudflare acquired VoidZero, putting Vite, Rolldown, and Oxc maintainers on a deploy-target vendor&apos;s payroll. MIT licensing stays. Roadmap neutrality is the open question.</description><pubDate>Fri, 05 Jun 2026 16:08:31 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>vite</category><category>cloudflare</category><category>voidzero</category><category>open-source-governance</category><category>javascript-tooling</category><category>rolldown</category><category>oxc</category><author>Groundy Editorial</author></item><item><title>Putting a Datacenter V100 in a Gaming PC: The Local LLM Math</title><link>https://groundy.com/articles/putting-a-datacenter-v100-in-a-gaming-pc-the-local-llm-math/</link><guid isPermaLink="true">https://groundy.com/articles/putting-a-datacenter-v100-in-a-gaming-pc-the-local-llm-math/</guid><description>A used V100 looks like cheap VRAM for local inference, but no bf16, no FlashAttention, and CUDA 13 deprecation lock buyers into a software stack that is actively contracting.</description><pubDate>Fri, 05 Jun 2026 04:10:09 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>v100</category><category>local-inference</category><category>nvidia</category><category>cuda</category><category>gpu-hardware</category><category>volta</category><category>llm-inference</category><author>Groundy Editorial</author></item><item><title>Cost-Aware RAG Routing: When Deeper Retrieval Stops Paying Off</title><link>https://groundy.com/articles/cost-aware-rag-routing-when-deeper-retrieval-stops-paying-off/</link><guid isPermaLink="true">https://groundy.com/articles/cost-aware-rag-routing-when-deeper-retrieval-stops-paying-off/</guid><description>CA-RAG shows fixed top-k wastes 26% of billed tokens on simple queries with no quality gain. Per-query routing changes RAG unit economics more than any embedding swap.</description><pubDate>Thu, 04 Jun 2026 01:56:08 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-04T00:00:00.000Z</atom:updated><category>rag</category><category>retrieval-augmented-generation</category><category>cost-optimization</category><category>query-routing</category><category>vector-search</category><category>inference-cost</category><category>token-billing</category><author>Groundy Editorial</author></item><item><title>Using Your Nvidia GPU&apos;s VRAM as Linux Swap: Where the NBD Hack Breaks Down</title><link>https://groundy.com/articles/using-your-nvidia-gpus-vram-as-linux-swap-where-the-nbd-hack-breaks-down/</link><guid isPermaLink="true">https://groundy.com/articles/using-your-nvidia-gpus-vram-as-linux-swap-where-the-nbd-hack-breaks-down/</guid><description>NBD-VRAM exposes GeForce VRAM as Linux swap, but PCIe latency and the zero-sum trade with GPU compute workloads make zram the stronger choice on most systems.</description><pubDate>Wed, 03 Jun 2026 23:39:41 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-03T00:00:00.000Z</atom:updated><category>nbd-vram</category><category>vram-swap</category><category>linux-swap</category><category>zram</category><category>memory-management</category><category>gpu-memory</category><author>Groundy Editorial</author></item><item><title>Cloudflare Turnstile Now Fingerprints WebGL: The Privacy CAPTCHA Tradeoff</title><link>https://groundy.com/articles/cloudflare-turnstile-now-fingerprints-webgl-the-privacy-captcha-tradeoff/</link><guid isPermaLink="true">https://groundy.com/articles/cloudflare-turnstile-now-fingerprints-webgl-the-privacy-captcha-tradeoff/</guid><description>A researcher found Cloudflare Turnstile now demands fingerprintable WebGL to pass challenges, contradicting its privacy policy that lists only IP, TLS, and User-Agent signals.</description><pubDate>Sun, 31 May 2026 11:14:13 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>cloudflare-turnstile</category><category>webgl</category><category>browser-fingerprinting</category><category>captcha</category><category>web-privacy</category><category>bot-detection</category><author>Groundy Editorial</author></item><item><title>The Viral AWS Support Post Is a Warning About Cloud Escalation Paths</title><link>https://groundy.com/articles/the-viral-aws-support-post-is-a-warning-about-cloud-escalation-paths/</link><guid isPermaLink="true">https://groundy.com/articles/the-viral-aws-support-post-is-a-warning-about-cloud-escalation-paths/</guid><description>A viral AWS support post exposed a structural pattern: hyperscalers thinning human escalation paths, raising risk for single-vendor teams reliant on reaching a person.</description><pubDate>Fri, 29 May 2026 13:16:36 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-29T00:00:00.000Z</atom:updated><category>aws-support</category><category>cloud-escalation</category><category>single-vendor-risk</category><category>hyperscaler-automation</category><category>incident-response</category><category>cloud-infrastructure</category><author>Groundy Editorial</author></item><item><title>Why LLMs Still Botch Kubernetes Manifests: The Training-Data Gap</title><link>https://groundy.com/articles/why-llms-still-botch-kubernetes-manifests-the-training-data-gap/</link><guid isPermaLink="true">https://groundy.com/articles/why-llms-still-botch-kubernetes-manifests-the-training-data-gap/</guid><description>A 1.5B-parameter model hits 91.5% on Kubernetes YAML generation, but the remaining failures are syntactically valid manifests that deploy and quietly violate cluster intent.</description><pubDate>Wed, 27 May 2026 21:20:59 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-28T00:00:00.000Z</atom:updated><category>kubernetes</category><category>llm-code-generation</category><category>yaml</category><category>fine-tuning</category><category>devops</category><category>platform-engineering</category><author>Groundy Editorial</author></item><item><title>Cloudflare Flagship Is a Feature Flag Service That Deepens Platform Gravity</title><link>https://groundy.com/articles/cloudflare-flagship-is-a-feature-flag-service-that-deepens-platform-gravity/</link><guid isPermaLink="true">https://groundy.com/articles/cloudflare-flagship-is-a-feature-flag-service-that-deepens-platform-gravity/</guid><description>Cloudflare&apos;s Flagship is a feature flag service with a native Workers binding that replaces third-party flag providers, consolidating more of the edge stack under one vendor.</description><pubDate>Wed, 27 May 2026 16:40:59 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-27T00:00:00.000Z</atom:updated><category>feature-flags</category><category>cloudflare-workers</category><category>edge-computing</category><category>platform-consolidation</category><category>vendor-lock-in</category><category>openfeature</category><author>Groundy Editorial</author></item><item><title>Gemma 4 31B on Cloud TPU vs GPU: The Serving Cost Crossover Point</title><link>https://groundy.com/articles/gemma-4-31b-on-cloud-tpu-vs-gpu-the-serving-cost-crossover-point/</link><guid isPermaLink="true">https://groundy.com/articles/gemma-4-31b-on-cloud-tpu-vs-gpu-the-serving-cost-crossover-point/</guid><description>TPU v6e Flex-start delivers 308M tokens per dollar for Gemma 4 31B prefill, undercutting H100 rates for open-weight serving, but production decode costs remain unquantified.</description><pubDate>Wed, 27 May 2026 10:20:46 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-27T00:00:00.000Z</atom:updated><category>inference-cost</category><category>gemma-4</category><category>tpu</category><category>gpu-inference</category><category>cloud-tpu</category><category>open-weight-models</category><author>Groundy Editorial</author></item><item><title>ObjectCache Moves KV Reuse to S3-Class Storage: Why Layerwise Retrieval Beats Full-Prefix Cache Hits</title><link>https://groundy.com/articles/objectcache-moves-kv-reuse-to-s3-class-storage-why-layerwise-retrieval-beats/</link><guid isPermaLink="true">https://groundy.com/articles/objectcache-moves-kv-reuse-to-s3-class-storage-why-layerwise-retrieval-beats/</guid><description>ObjectCache retrieves KV cache per-layer from S3, adding 5.6% TTFT at 64K context but 56-75 ms at 4K. Long-context deployments where DRAM is the bottleneck benefit most.</description><pubDate>Tue, 26 May 2026 19:28:05 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>kv-cache</category><category>vllm</category><category>inference-optimization</category><category>object-storage</category><category>llm-serving</category><category>sglang</category><author>Groundy Editorial</author></item><item><title>Vercel&apos;s CDN Origin Timeout Jumps to 2 Minutes: A Concession to LLM Streaming Workloads</title><link>https://groundy.com/articles/vercels-cdn-origin-timeout-jumps-to-2-minutes-a-concession-to-llm-streaming/</link><guid isPermaLink="true">https://groundy.com/articles/vercels-cdn-origin-timeout-jumps-to-2-minutes-a-concession-to-llm-streaming/</guid><description>Vercel raised its CDN origin timeout from 30s to 120s to support LLM streaming, removing a constraint that forced teams to route AI traffic through separate infrastructure.</description><pubDate>Tue, 26 May 2026 18:08:55 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>cdn</category><category>llm-streaming</category><category>vercel</category><category>serverless</category><category>edge-computing</category><category>sse</category><author>Groundy Editorial</author></item><item><title>Fluid Compute vs PgBouncer: Vercel&apos;s Undocumented Bet on Connection Reuse</title><link>https://groundy.com/articles/fluid-compute-vs-pgbouncer-vercels-undocumented-bet-on-connection-reuse/</link><guid isPermaLink="true">https://groundy.com/articles/fluid-compute-vs-pgbouncer-vercels-undocumented-bet-on-connection-reuse/</guid><description>Vercel&apos;s Fluid Compute claims to hold Postgres connections open across requests, potentially eliminating PgBouncer, but the claim lacks published technical specs.</description><pubDate>Tue, 26 May 2026 15:03:32 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>vercel-fluid-compute</category><category>connection-pooling</category><category>postgresql</category><category>pgbouncer</category><category>serverless</category><category>infrastructure</category><author>Groundy Editorial</author></item><item><title>Railway&apos;s GCP Suspension Is a Reseller PaaS Problem, Not a Google One</title><link>https://groundy.com/articles/railways-gcp-suspension-is-a-reseller-paas-problem-not-a-google-one/</link><guid isPermaLink="true">https://groundy.com/articles/railways-gcp-suspension-is-a-reseller-paas-problem-not-a-google-one/</guid><description>Railway&apos;s eight-hour outage shows why every reseller PaaS with a single upstream account is one billing flag away from total blackout, and what teams should audit now.</description><pubDate>Tue, 26 May 2026 10:49:16 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>reseller-paas</category><category>cloud-infrastructure</category><category>service-discovery</category><category>gcp</category><category>platform-reliability</category><category>incident-management</category><author>Groundy Editorial</author></item><item><title>Vercel Fluid Pools Database Connections Across Invocations, Bypassing External Poolers</title><link>https://groundy.com/articles/vercel-fluid-pools-database-connections-across-invocations-bypassing-external/</link><guid isPermaLink="true">https://groundy.com/articles/vercel-fluid-pools-database-connections-across-invocations-bypassing-external/</guid><description>Fluid Compute reuses Postgres connections across warm invocations via attachDatabasePool, dropping the pooler for simple apps but not for shared-database architectures.</description><pubDate>Mon, 25 May 2026 17:24:24 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>vercel-fluid</category><category>connection-pooling</category><category>postgres</category><category>serverless</category><category>pgbouncer</category><category>database-infrastructure</category><author>Groundy Editorial</author></item><item><title>Vercel CDN Request Collapsing: One Origin Fetch Per ISR Cache Miss</title><link>https://groundy.com/articles/vercel-cdn-request-collapsing-one-origin-fetch-per-isr-cache-miss/</link><guid isPermaLink="true">https://groundy.com/articles/vercel-cdn-request-collapsing-one-origin-fetch-per-isr-cache-miss/</guid><description>Vercel&apos;s CDN collapses concurrent ISR cache misses into one function invocation per region, reducing origin load but leaving dynamic routes and external origins exposed.</description><pubDate>Mon, 25 May 2026 14:35:12 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>isr</category><category>vercel</category><category>cdn</category><category>request-collapsing</category><category>nextjs</category><category>capacity-planning</category><category>edge-caching</category><author>Groundy Editorial</author></item><item><title>CISA Admin Leaked AWS GovCloud Keys on GitHub: What Federal Secret Scanning Missed</title><link>https://groundy.com/articles/cisa-admin-leaked-aws-govcloud-keys-on-github-what-federal-secret-scanning/</link><guid isPermaLink="true">https://groundy.com/articles/cisa-admin-leaked-aws-govcloud-keys-on-github-what-federal-secret-scanning/</guid><description>A CISA contractor left live AWS GovCloud admin keys on a public GitHub repo for six months, exposing gaps in federal credential hygiene and FedRAMP boundary monitoring.</description><pubDate>Mon, 25 May 2026 12:34:45 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>aws-govcloud</category><category>cisa</category><category>credential-leak</category><category>fedramp</category><category>secret-scanning</category><category>github-security</category><author>Groundy Editorial</author></item><item><title>What Cloudflare&apos;s Q1 2026 Outage Data Says About Designing for State-Level Shutdowns</title><link>https://groundy.com/articles/what-cloudflares-q1-2026-outage-data-says-about-designing-for-state-level/</link><guid isPermaLink="true">https://groundy.com/articles/what-cloudflares-q1-2026-outage-data-says-about-designing-for-state-level/</guid><description>Three state-ordered shutdowns, drone strikes on AWS data centers, and grid collapses in Q1 2026 prove that multi-region failover cannot survive country-level failure domains.</description><pubDate>Sun, 24 May 2026 17:21:36 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-24T00:00:00.000Z</atom:updated><category>internet-shutdowns</category><category>disaster-recovery</category><category>cloud-infrastructure</category><category>cloudflare</category><category>bgp-monitoring</category><category>multi-region-failover</category><author>Groundy Editorial</author></item><item><title>Railway&apos;s May 19 GCP Suspension Exposes the Single-Account Risk Underneath Every Reseller PaaS</title><link>https://groundy.com/articles/railways-may-19-gcp-suspension-exposes-the-single-account-risk-underneath-every/</link><guid isPermaLink="true">https://groundy.com/articles/railways-may-19-gcp-suspension-exposes-the-single-account-risk-underneath-every/</guid><description>Railway&apos;s eight-hour outage shows that multi-cloud data planes mean nothing when the control plane lives on a single provider account that can be suspended without notice.</description><pubDate>Sat, 23 May 2026 17:36:07 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>gcp</category><category>railway</category><category>cloud-outage</category><category>control-plane</category><category>paas</category><category>multi-cloud</category><category>infrastructure-resilience</category><author>Groundy Editorial</author></item><item><title>vLLM 0.21 Makes Prefill-Decode Disaggregation Actually Practical</title><link>https://groundy.com/articles/vllm-v0-21-adds-bi-directional-kv-cache-transfers-between-prefill-and-decode/</link><guid isPermaLink="true">https://groundy.com/articles/vllm-v0-21-adds-bi-directional-kv-cache-transfers-between-prefill-and-decode/</guid><description>vLLM v0.21 reportedly adds bi-directional KV cache transfers between prefill and decode nodes, making P/D ratios dynamic and requiring new NIXL transfer telemetry.</description><pubDate>Sat, 23 May 2026 10:25:12 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>vllm</category><category>kv-cache</category><category>disaggregated-serving</category><category>nixl</category><category>inference-infrastructure</category><category>gpu-scheduling</category><author>Groundy Editorial</author></item><item><title>DMax Hits 1,338 Tokens/Sec on 2x H200: Parallel Decoding Pushes dLLM Serving Past the Autoregressive Bar</title><link>https://groundy.com/articles/dmax-hits-1-338-tokens-sec-on-2x-h200-parallel-decoding-pushes-dllm-serving/</link><guid isPermaLink="true">https://groundy.com/articles/dmax-hits-1-338-tokens-sec-on-2x-h200-parallel-decoding-pushes-dllm-serving/</guid><description>DMax reformulates diffusion LLM decoding as embedding refinement, achieving 1,338 tok/s on 2× H200 and challenging ParallelBench&apos;s parallel-decoding quality trade-off finding.</description><pubDate>Tue, 19 May 2026 14:57:29 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-19T00:00:00.000Z</atom:updated><category>diffusion-llm</category><category>parallel-decoding</category><category>llm-serving</category><category>inference</category><category>sglang</category><category>throughput</category><category>llada</category><author>Groundy Editorial</author></item><item><title>Kioxia and Dell&apos;s 10 PB in 2RU: What Storage Density Means for Cluster Power and Rebuild Windows</title><link>https://groundy.com/articles/kioxia-and-dells-10-pb-in-2ru-what-storage-density-means-for-cluster-power/</link><guid isPermaLink="true">https://groundy.com/articles/kioxia-and-dells-10-pb-in-2ru-what-storage-density-means-for-cluster-power/</guid><description>Kioxia and Dell packed 9.8 PB into a 2U server. At 245 TB per drive, rebuilds take 14-27 hours, forcing teams to retune erasure coding for production clusters.</description><pubDate>Mon, 18 May 2026 17:38:28 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-18T00:00:00.000Z</atom:updated><category>storage-density</category><category>nvme</category><category>data-center</category><category>erasure-coding</category><category>flash-storage</category><category>dell</category><category>ssd</category><author>Groundy Editorial</author></item><item><title>KV Cache Offloading Breaks on Context-Intensive Tasks: Text2JSON Exposes the Landmark Failure Mode</title><link>https://groundy.com/articles/kv-cache-offloading-breaks-on-context-intensive-tasks-text2json-exposes/</link><guid isPermaLink="true">https://groundy.com/articles/kv-cache-offloading-breaks-on-context-intensive-tasks-text2json-exposes/</guid><description>ShadowKV-style KV cache offloading methods pass NIAH and RULER but collapse on synthesis tasks. Text2JSON quantifies the gap; YAKV&apos;s per-key selection fixes it.</description><pubDate>Mon, 18 May 2026 11:21:56 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-18T00:00:00.000Z</atom:updated><category>kv-cache</category><category>inference</category><category>long-context</category><category>llm-benchmarks</category><category>kv-offloading</category><category>quantization</category><author>Groundy Editorial</author></item><item><title>Crawshaw&apos;s &apos;I Am Building a Cloud&apos;: What a Tailscale Co-Founder&apos;s Solo Stack Implies for Platform Teams</title><link>https://groundy.com/articles/crawshaws-i-am-building-a-cloud-what-a-tailscale-co-founders-solo-stack-implies/</link><guid isPermaLink="true">https://groundy.com/articles/crawshaws-i-am-building-a-cloud-what-a-tailscale-co-founders-solo-stack-implies/</guid><description>David Crawshaw&apos;s exe.dev launched with $35M, giving platform teams a concrete alternative to the Kubernetes default that forces TCO justification for cloud-native overhead.</description><pubDate>Wed, 29 Apr 2026 09:30:40 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-04-29T00:00:00.000Z</atom:updated><category>infrastructure</category><category>kubernetes</category><category>cloud-costs</category><category>bare-metal</category><category>platform-engineering</category><category>tailscale</category><category>exe-dev</category><author>Groundy Editorial</author></item><item><title>UCCL-Zip: Lossless Compression for NCCL, 47.5% Faster RL Sync, 10% Lower vLLM Latency</title><link>https://groundy.com/articles/uccl-zip-brings-lossless-compression-to-nccl-collectives-475-faster-rl-weight/</link><guid isPermaLink="true">https://groundy.com/articles/uccl-zip-brings-lossless-compression-to-nccl-collectives-475-faster-rl-weight/</guid><description>UCCL-Zip fuses lossless compression into NCCL and GPU P2P transfers, cutting RL weight sync by 47.5% and vLLM latency by 10% with no API changes and bit-identical outputs.</description><pubDate>Fri, 24 Apr 2026 12:07:10 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>nccl</category><category>lossless-compression</category><category>gpu-communication</category><category>vllm</category><category>rl-training</category><category>prefill-decode</category><category>distributed-inference</category><author>Groundy Editorial</author></item><item><title>Ingress-Nginx Is Dead, Not Deprecated: Final CVE Patches Shipped, But Platform Teams Need a Migration Plan</title><link>https://groundy.com/articles/ingress-nginx-is-dead-not-deprecated-the-final-cve-patches-shipped-but-platform/</link><guid isPermaLink="true">https://groundy.com/articles/ingress-nginx-is-dead-not-deprecated-the-final-cve-patches-shipped-but-platform/</guid><description>ingress-nginx was retired March 24, 2026. CVE-2026-4342 patches shipped March 19, but no future fixes are coming. How platform teams should pick a migration path.</description><pubDate>Thu, 23 Apr 2026 11:41:23 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>kubernetes</category><category>networking</category><category>ingress</category><category>gateway-api</category><category>security</category><category>migration</category><author>Groundy Editorial</author></item><item><title>OpenRAG: The Open-Source RAG Platform Challenging Pinecone</title><link>https://groundy.com/articles/openrag-open-source-rag-platform-challenging/</link><guid isPermaLink="true">https://groundy.com/articles/openrag-open-source-rag-platform-challenging/</guid><description>OpenRAG combines Langflow, OpenSearch, and Docling into a single deployable RAG platform. Here&apos;s how it compares to managed services like Pinecone.</description><pubDate>Fri, 27 Mar 2026 19:24:22 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><category>ai-infrastructure</category><category>rag</category><category>open-source</category><author>Groundy Editorial</author></item><item><title>MLX vs llama.cpp on Apple Silicon: Which Runtime to Use for Local LLM Inference</title><link>https://groundy.com/articles/mlx-vs-llamacpp-on-apple-silicon-which-runtime-to-use-for-local-llm-inference/</link><guid isPermaLink="true">https://groundy.com/articles/mlx-vs-llamacpp-on-apple-silicon-which-runtime-to-use-for-local-llm-inference/</guid><description>MLX delivers 20-87% faster generation on Apple Silicon for models under 14B parameters. llama.cpp wins for cross-platform use and long contexts.</description><pubDate>Tue, 24 Mar 2026 16:48:05 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-04T00:00:00.000Z</atom:updated><category>mlx</category><category>llama-cpp</category><category>apple-silicon</category><category>on-device-inference</category><category>macos</category><category>local-llm</category><author>Groundy Editorial</author></item><item><title>Prefill-Decode Disaggregation: The Architecture Shift Redefining LLM Serving at Scale</title><link>https://groundy.com/articles/prefill-decode-disaggregation-the-architecture-shift-redefining-llm-serving-at-scale/</link><guid isPermaLink="true">https://groundy.com/articles/prefill-decode-disaggregation-the-architecture-shift-redefining-llm-serving-at-scale/</guid><description>Prefill-decode disaggregation separates compute-bound prefill from memory-bound decode onto dedicated hardware, eliminating phase interference.</description><pubDate>Tue, 24 Mar 2026 16:04:14 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>LLM serving</category><category>prefill</category><category>decode disaggregation</category><category>inference infrastructure</category><category>Mooncake</category><author>Groundy Editorial</author></item><item><title>Google LiteRT: Running LLMs on Your Phone Without the Cloud</title><link>https://groundy.com/articles/google-litert-running-llms-your-phone-without/</link><guid isPermaLink="true">https://groundy.com/articles/google-litert-running-llms-your-phone-without/</guid><description>Google&apos;s LiteRT (formerly TensorFlow Lite) is now the production backbone for on-device GenAI across Android, Chrome, and Pixel devices. Here&apos;s what it means for developers building AI apps that run privately, without the cloud.</description><pubDate>Mon, 23 Mar 2026 09:34:21 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-15T00:00:00.000Z</atom:updated><category>ai-infrastructure</category><category>mobile</category><category>edge-ai</category><author>Groundy Editorial</author></item><item><title>Microsoft&apos;s BitNet: How 1-Bit LLMs Could Make GPU Farms Obsolete</title><link>https://groundy.com/articles/microsoft-s-bitnet-how-1-bit-llms-could-make-gpu-farms/</link><guid isPermaLink="true">https://groundy.com/articles/microsoft-s-bitnet-how-1-bit-llms-could-make-gpu-farms/</guid><description>Microsoft&apos;s BitNet inference framework runs billion-parameter LLMs on ordinary CPUs using ternary weights, delivering up to 6x faster inference and 82% lower energy consumption, potentially upending the assumption that AI inference requires expensive GPU hardware.</description><pubDate>Fri, 13 Mar 2026 17:06:56 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><category>ai-infrastructure</category><category>models</category><category>hardware</category><author>Groundy Editorial</author></item><item><title>WebAssembly AI: Running Models in the Browser</title><link>https://groundy.com/articles/webassembly-ai-running-models/</link><guid isPermaLink="true">https://groundy.com/articles/webassembly-ai-running-models/</guid><description>WebAssembly enables production-ready AI inference directly in the browser, no server required. Learn how WASM, WebGPU, and modern frameworks make client-side ML practical, what the performance trade-offs actually look like, and when to use it.</description><pubDate>Sat, 28 Feb 2026 18:18:13 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-18T00:00:00.000Z</atom:updated><category>webassembly</category><category>browser-ai</category><author>Groundy Editorial</author></item><item><title>Tailscale Peer Relays: The Missing Piece for True P2P Networking</title><link>https://groundy.com/articles/tailscale-peer-relays-missing-piece-true-p2p/</link><guid isPermaLink="true">https://groundy.com/articles/tailscale-peer-relays-missing-piece-true-p2p/</guid><description>Tailscale Peer Relays became generally available on February 18, 2026, enabling high-throughput peer-to-peer relaying within your own infrastructure. This feature eliminates the performance bottleneck of DERP servers when NAT traversal fails, delivering true mesh networking even in restrictive network environments.</description><pubDate>Thu, 19 Feb 2026 19:49:47 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-11T00:00:00.000Z</atom:updated><category>networking</category><category>vpn</category><category>infrastructure</category><category>p2p</category><author>Groundy Editorial</author></item><item><title>DNS-Persist-01 Validation: Let&apos;s Encrypt&apos;s Model for Permanent ACME Certificate Authorization</title><link>https://groundy.com/articles/letsencrypt-dns-persist-01/</link><guid isPermaLink="true">https://groundy.com/articles/letsencrypt-dns-persist-01/</guid><description>DNS-Persist-01 proposes persistent DNS TXT records for ACME certificate validation, removing per-renewal DNS updates as certificate lifetimes shrink toward 47 days by 2029.</description><pubDate>Thu, 19 Feb 2026 18:04:22 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-25T00:00:00.000Z</atom:updated><category>web-platform</category><category>security</category><category>infrastructure</category><category>ssl</category><category>dns</category><author>Groundy Editorial</author></item><item><title>The Complete Guide to Local LLMs</title><link>https://groundy.com/articles/the-complete-guide-to-local-llms/</link><guid isPermaLink="true">https://groundy.com/articles/the-complete-guide-to-local-llms/</guid><description>Why running AI on your own hardware is becoming the default choice for privacy-conscious developers and enterprises that need data sovereignty, cost control, and low latency.</description><pubDate>Thu, 12 Feb 2026 18:57:59 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-08T00:00:00.000Z</atom:updated><category>local-ai</category><category>ollama</category><category>llamacpp</category><category>vllm</category><category>privacy</category><category>self-hosted-ai</category><author>Groundy Editorial</author></item></channel></rss>