IonRouter (Cumulus Labs, YC W26) is an OpenAI-compatible inference API that routes LLM and multimodal requests through a custom C++ engine called IonAttention, purpose-built for NVIDIA’s GH200 Grace Hopper hardware. The result: approximately 2x the throughput of top-tier alternatives like Together AI on comparable workloads, at roughly half the market rate per token — with no idle fees.
What Problem Does IonRouter Actually Solve?
The LLM inference market has fractured into two uncomfortable extremes. On one side: managed providers like Together AI and Fireworks AI, which offer clean APIs, solid performance, and straightforward pricing — but at premium cost. On the other: self-hosted options like Modal and RunPod that are cheaper per GPU-hour but require substantial infrastructure engineering overhead to operate reliably at scale.
Neither option works well for teams building AI pipelines with real production requirements and budget pressure simultaneously. That gap is where Cumulus Labs positioned IonRouter.
As the founders stated in their Hacker News launch thread: “Inference providers are either fast-but-expensive (like Together and Fireworks) or cheap-but-DIY (like Modal and RunPod).” IonRouter is their attempt at a third option: managed convenience at self-hosted economics.
How IonRouter Works: The IonAttention Engine
The core differentiation isn’t routing in the traditional sense — IonRouter doesn’t broker between third-party providers. Instead, Cumulus Labs built their own inference engine, IonAttention, from scratch in C++, and optimized it specifically for the NVIDIA GH200 Grace Hopper Superchip.
The GH200 is architecturally unusual. It unifies a 72-core ARM CPU and Hopper GPU via NVLink-C2C — a coherent interconnect running at 900 GB/s — giving both processors access to a shared 452 GB LPDDR5X memory pool. Standard inference stacks built for x86+H100 don’t exploit this architecture. IonAttention was designed from the start to.2
Three specific optimizations drive the performance gains:
1. Coherent Memory for Dynamic CUDA Graphs
Traditional CUDA graph execution is static: once a computation graph is captured, its parameters are fixed. Patching nodes to update variable-length inputs introduces overhead. IonAttention exploits the GH200’s hardware cache coherence — the CPU updates runtime parameters in shared memory, and the GPU picks them up automatically without re-capture. This delivers 10–20% lower decode latency on variable-length workloads.3
2. Eager KV Cache Writeback
In standard inference runtimes, KV cache eviction (moving cached keys and values out of fast GPU memory to accommodate new requests) is synchronous and blocking — causing visible stalls. IonAttention streams completed KV blocks to the large LPDDR5X pool in the background using separate CUDA streams, exploiting the immutability of filled blocks. The result: blocking eviction latency drops from 10ms+ to under 0.25ms — a 40x reduction. The bidirectional coherent link provides an additional 1.2x speedup on swap operations.
3. Phantom-Tile Attention Scheduling
Small-batch inference underutilizes GPU streaming multiprocessors. IonAttention deliberately over-provisions the GPU compute grid with extra “phantom” tiles that exit early via bounds checks, keeping idle SMs occupied. This reduces attention computation time by over 60% in underutilized scenarios and contributes 10–20% higher throughput at high concurrency.
Additional optimizations include GPU-side temperature and top-p sampling (reducing sampling latency from 37–50ms to ~150 microseconds per step), three-stream compute/prefetch/writeback pipelining, and speculative draft models running on the GH200’s ARM cores.
Performance Benchmarks
The headline numbers from Cumulus Labs’ published benchmarks:3
| Workload | IonRouter (GH200) | Together AI | Improvement |
|---|---|---|---|
| Qwen3-VL-8B (vision-language) | 588 tok/s | 298 tok/s | ~2x |
| Qwen2.5-7B-Instruct (language) | 7,167 tok/s | ~3,000 tok/s | ~2.4x |
| Concurrent VLM streams | 5 models, 1 GPU | — | — |
| KV cache eviction latency | <0.25ms | 10ms+ | 40x |
Throughput and latency trade off differently depending on batch size and concurrency. IonRouter’s architecture particularly benefits high-concurrency, throughput-bound workloads — AI video pipelines, robotics perception systems, multi-stream analysis — not necessarily chat applications where individual request latency dominates.
Pricing and Market Positioning
IonRouter uses per-million-token pricing with no idle fees:
| Model | Input ($/M tokens) | Output ($/M tokens) |
|---|---|---|
| GPT-OSS-120B | $0.020 | $0.095 |
| Kimi-K2.5 | $0.20 | $1.60 |
| Qwen3.5-122B | $0.20 | $1.60 |
| Wan2.2 (video) | $0.00194/GPU·sec | — |
| Flux Schnell (image) | ~$0.005/image | — |
Custom fine-tuned models and LoRAs run on dedicated GPU streams at approximately $8–10 per GPU-hour, with per-second billing and no cold starts.
For context: Andreessen Horowitz’s LLMflation analysis tracks inference costs decreasing roughly 10x per year since 2021. What cost $60/million tokens in 2021 now costs $0.06/million for equivalent performance.4 This secular cost decline means the economics of inference are a moving target — but it also means optimization-focused providers can command meaningful margins against commoditized GPU spot pricing.
Who Built This and Why Now?
Cumulus Labs was founded by Veer Shah, who led ML infrastructure and Linux kernel development for Space Force and NASA contracts, and Suryaa, who built GPU orchestration infrastructure at TensorDock and production systems at Palantir. Their background isn’t in AI product development — it’s in infrastructure.
That background is evident in the product. The team’s decision to build a custom inference engine rather than wrapping vLLM or TensorRT-LLM is a significant commitment. Most API-layer inference providers are thin wrappers. Building IonAttention at the C++ level, with hardware-specific optimizations for the GH200’s unique memory architecture, represents a different bet: that hardware heterogeneity in the inference market will persist long enough to justify the investment in purpose-built runtimes.
The timing reflects a market reality: the inference chip market is projected to exceed $50 billion in 2026, and NVIDIA’s GH200/GB200 line is designed specifically for inference workloads where the traditional H100 architecture (optimized for training) isn’t ideal. First-movers on new hardware with optimized software stacks have a window before the standard frameworks catch up.
The Multi-Model Serving Claim
One capability that stands out technically is the “5 VLMs, 1 GPU” result — running five concurrent vision-language models with 2,700 concurrent video clips on a single GH200. This is enabled by the sub-750ms model switching time (achieved by preserving CUDA graphs between model loads) and the large 452 GB shared memory pool.
For comparison, most inference stacks require either dedicated GPU allocation per model or accept significant cold-start penalties when switching. The ability to multiplex multiple models on single hardware could meaningfully reduce per-model infrastructure overhead for teams running diverse model portfolios — robotics companies running separate models for perception, planning, and control, for instance.
What’s Unproven
The Hacker News launch discussion raised legitimate questions that remain open:
- Quantization transparency: Users asked what quantization levels the hosted models use, which directly affects output quality and cost comparisons.
- Privacy policy: No published privacy terms comparable to enterprise providers like Google Vertex AI or AWS Bedrock, which matters for regulated industries.
- Input caching pricing: No published pricing for cached input tokens, a significant cost lever for long-context and multi-turn workloads.
- Documentation depth: Community feedback flagged gaps in technical documentation, particularly for advanced features.
These aren’t disqualifying for early-stage infrastructure companies, but they’re real friction points for enterprise adoption. The product is clearly built for developers willing to trade polish for performance at this stage.
Why This Matters for Practitioners
The inference infrastructure layer is rarely covered with the same attention as foundation models, but it’s where the majority of production AI cost accumulates. As GPUnex analysis notes, inference represents two-thirds of AI compute in 2026 — and that fraction grows as more organizations move from experimentation to production.
The practical implication of IonRouter’s approach: if a team is running throughput-bound workloads — AI video, robotics perception, batch document processing — and currently paying Together AI or Fireworks rates, the math on switching is straightforward. At equivalent throughput, half the cost per token is half the infrastructure bill.
The harder question is whether hardware-specific software optimization is a durable moat, or a temporary advantage that evaporates as vLLM and other general-purpose stacks add GH200 optimizations. Historically, hardware-aware compilers and runtimes do maintain advantages — but the gap narrows over time. Cumulus Labs is betting they can stay ahead of general-purpose frameworks by continuing to build deeper into each new hardware generation.
# IonRouter is drop-in compatible with OpenAI clientsfrom openai import OpenAI
client = OpenAI( api_key="your-ionrouter-key", base_url="https://api.ionrouter.io/v1" # only change required)
response = client.chat.completions.create( model="qwen2.5-7b-instruct", messages=[{"role": "user", "content": "Analyze this frame..."}])The migration path for existing OpenAI SDK users requires changing one line of configuration — a deliberate design choice that removes adoption friction for teams already invested in the OpenAI client ecosystem.
Frequently Asked Questions
Q: How does IonRouter differ from OpenRouter? A: OpenRouter is a meta-router that routes requests to third-party inference providers and charges a markup on top of their pricing. IonRouter runs its own inference infrastructure on NVIDIA GH200 hardware with a custom engine, competing directly with providers like Together AI and Fireworks on throughput and price per token.
Q: Is IonRouter suitable for latency-sensitive applications like real-time chat? A: Not at present for all workloads — the team has acknowledged P50 latency of ~1.46 seconds versus ~0.74 seconds for some comparable providers. IonRouter is optimized for throughput-bound workloads (AI video, robotics perception, batch processing) rather than low-latency interactive applications. Benchmark your specific use case before committing.
Q: Can I run custom fine-tuned models on IonRouter? A: Yes. IonRouter supports custom fine-tuned models and LoRA adapters on dedicated GPU streams with per-second billing at approximately $8–10/GPU-hour. Model switching is available in under 750ms with CUDA graph preservation.
Q: What models does IonRouter currently support? A: As of March 2026, IonRouter supports 9+ language models (including Kimi-K2.5, Qwen3.5-122B, GPT-OSS-120B), 7+ video generation models (including Wan2.2), 4+ vision models, and 3+ audio models. The catalog is expanding.
Q: How does IonRouter’s pricing compare in the broader context of LLM cost trends? A: A16z’s LLMflation research tracks inference costs declining roughly 10x annually since 2021. IonRouter’s GPT-OSS-120B pricing ($0.02/$0.095 input/output per million tokens) sits well below the managed provider market rate for 100B+ parameter models, roughly consistent with the current cost frontier for open-weight models as of early 2026.
Sources:
- Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference
- IonAttention: Grace Hopper–Native Inference | Cumulus Labs Blog
- IonRouter
- Welcome to LLMflation | a16z
- More Compute for AI, Not Less | Deloitte
- AI Inference Economics: The 1,000× Cost Collapse | GPUnex
- Cumulus Labs | Y Combinator
- NVIDIA GH200 Grace Hopper Superchip Accelerates Inference 2x | NVIDIA
- LLM Inference Benchmarking | NVIDIA Technical Blog
- Best Inference Providers for AI Agents 2026 | Fast.io
- 16 Best OpenRouter Alternatives | Premai
- LLM Inference Optimization Techniques | Hakia
Footnotes
-
Deloitte. “More Compute for AI, Not Less.” Technology, Media and Telecom Predictions 2026. https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2026/compute-power-ai.html ↩
-
NVIDIA. “GH200 Grace Hopper Superchip.” NVIDIA Data Center. https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/ ↩
-
Cumulus Labs. “IonAttention: Grace Hopper–Native Inference.” Cumulus Blog. https://cumulus.blog/ionattention ↩ ↩2
-
Andreessen Horowitz. “Welcome to LLMflation.” a16z. https://a16z.com/llmflation-llm-inference-cost/ ↩
-
Premai. “16 Best OpenRouter Alternatives for Private, Production AI.” 2026. https://blog.premai.io/best-openrouter-alternatives-for-private-production-ai/ ↩