CUDA Tile, Nvidia’s tile-based programming model introduced in CUDA Toolkit 13.1, promises to make GPU kernel authorship accessible to developers who don’t write WMMA intrinsics for a living. A new arXiv preprint benchmarks that promise against hand-tuned kernels, vendor libraries, and Triton across two architectures. The short answer: Tile closes the gap on one GPU, leaves it wide open on another, and trails Triton in portability.
What CUDA Tile promises
CUDA 13.1, released in late 2025, introduced the CUDA Tile IR and a Python interface called cuTile. The pitch is straightforward: express computation over tiles rather than threads, and the compiler maps those tiles to the hardware. Initial support covered Blackwell GPUs (compute capability 10.x and 12.x). CUDA 13.2 later extended Tile to Ampere and Ada architectures and added closures and recursion to the cuTile Python layer.
The practitioner appeal is real. Writing a performant GEMM kernel in CUDA requires handling shared-memory tiling, warp-level matrix operations, and memory coalescing by hand. The arXiv paper reports that a cuTile GEMM kernel fits in 22 lines of Python, compared to 123 lines for an equivalent WMMA implementation, according to the authors’ measurements.
Benchmark setup
The paper, updated to v2 on June 3, 2026, evaluates CuTile on two GPU families: Nvidia’s datacenter-class Blackwell B200 and the RTX PRO 6000 (sm_120). The workloads are GEMM and fused attention, the two operations that dominate transformer inference and training time. Baselines include cuBLAS (Nvidia’s vendor-optimized library), FlashAttention-2, and Triton. All throughput figures below are the authors’ measurements; readers cannot reproduce them without identical testbeds.
The headline result: attention on B200
On the B200, CuTile achieves up to 1007 TFLOP/s for fused attention, which the authors report as a 2.5x speedup over FlashAttention-2, written in 60 lines of Python kernel code. This is the number that will circulate in conference talks and Nvidia’s own marketing materials. It is legitimate as measured, and it demonstrates that the Tile abstraction can reach high throughput on the architecture it was designed for.
For practitioners already running on Blackwell datacenter hardware, this result is immediately relevant. Fused attention is a bottleneck in every transformer training run. A 2.5x improvement over FlashAttention-2 in a maintainable Python kernel is not a marginal optimization.
The gap: GEMM and cross-architecture regressions
GEMM tells a different story. CuTile reaches 52-79% of cuBLAS throughput depending on matrix dimensions, according to the paper. cuBLAS has had years of architecture-specific tuning for every memory path and warp configuration; a 22-line kernel does not match that. The gap is consistent and significant. CuTile is a practical replacement for hand-written CUDA kernels, but not yet for vendor-optimized libraries.
The cross-architecture picture is worse. The same CuTile attention kernel that hits 1007 TFLOP/s on the B200 achieves only 53% of FlashAttention-2 throughput on the RTX PRO 6000 (sm_120). One kernel, two Nvidia GPUs, a roughly 5x performance ratio. The abstraction is not architecture-agnostic. What the compiler optimizes well for Blackwell datacenter silicon, it does not automatically port to Blackwell client silicon.
The portability question: Triton vs. CuTile
The quiet comparison in this paper is with Triton, which sustains 62-101% of cuBLAS across all tested platforms without architecture-specific tuning. That range is narrower and higher than CuTile’s GEMM showing (52-79% of cuBLAS), and Triton does it without the per-architecture variance that drops CuTile to 53% on the RTX PRO 6000.
| Metric | CuTile | Triton |
|---|---|---|
| GEMM vs. cuBLAS | 52-79% | 62-101% |
| Cross-architecture variance | High (1007 TFLOP/s on B200 vs. 53% of FA2 on RTX PRO 6000) | Low (stable % of cuBLAS across platforms) |
| Code size (GEMM) | 22 lines | Not reported in brief |
| Architecture-specific tuning | Required for peak performance | Not required |
Triton’s advantage is portability without tuning. CuTile’s advantage, where it wins, is absolute throughput on specific hardware. These are different tradeoffs, and the right choice depends on whether a team optimizes for a single deployed architecture or for fleet heterogeneity.
What this means for practitioners
CuTile lowers the barrier to writing GPU kernels from “must understand warp-level matrix instructions” to “must understand tile semantics.” That is a genuine reduction in specialist labor. A 22-line GEMM kernel that reaches 79% of cuBLAS is good enough for many production workloads where the kernel is not the bottleneck.
But the bottleneck does not disappear. It shifts. Where the constraint used to be “can anyone on the team write a CUDA kernel,” it becomes “can anyone tune the CuTile compiler’s output for a specific GPU variant.” The B200 results show the ceiling is high. The RTX PRO 6000 results show the floor is low. And Triton demonstrates that an alternative abstraction can deliver more consistent cross-architecture performance without the tuning step.
For teams standardizing on Blackwell datacenter hardware, CuTile is worth adopting now for attention workloads. For teams running mixed fleets or client GPUs, Triton’s portability advantage makes it the safer default until CuTile’s compiler matures. The paper’s v2 update is dated June 3, 2026; these numbers will improve as Nvidia iterates on the Tile compiler, but the architectural dependency is a property of the abstraction, not a bug in the implementation.
Frequently Asked Questions
Does CuTile support GPUs older than Blackwell?
CUDA 13.2 extended Tile to Ampere (sm_80, sm_86) and Ada (sm_89), but the benchmark paper tests only Blackwell-class devices. The 52-79% cuBLAS and 53% FA2 figures come exclusively from B200 and RTX PRO 6000. Performance on older architectures remains unmeasured in independent work.
Why does Triton deliver more consistent results across GPU architectures?
Triton compiles through MLIR and LLVM to PTX, inheriting years of upstream optimization passes that apply across GPU generations. CuTile’s compiler uses a newer IR with fewer mature optimization passes. The paper shows Triton sustaining 62-101% of cuBLAS with zero per-device tuning, while CuTile’s throughput swings from 1007 TFLOP/s on one device to 53% of a baseline on another.
What programming language do teams write CuTile kernels in?
cuTile is a Python-only interface. CUDA 13.2 added closures and recursion to the language, but no C++ or CUDA C++ surface is documented as of the paper’s publication. Teams with existing C++ inference servers face a language boundary: they must either invoke compiled CuTile kernels through the driver API or carry a Python runtime dependency.
Does the 2.5x attention speedup transfer to other fused operations?
The paper benchmarks only GEMM and fused attention. FlashAttention-2 has a specific memory access pattern (online softmax with tiled QKV) that maps naturally onto Tile’s block abstraction. Other fused operations such as layer normalization, custom reductions, or mixed dropout-activation patterns were not tested, so the speedup cannot be assumed to generalize.
What would CuTile need to close the GEMM gap with cuBLAS?
cuBLAS benefits from years of per-architecture autotuning across tile sizes, pipeline depths, and warp configurations that a 22-line kernel cannot replicate. CuTile’s compiler would need comparable autotuning infrastructure plus tighter scheduling for shared-memory-to-register bandwidth. The 52-79% range indicates the current compiler leaves substantial memory throughput unused in the GEMM kernel’s compute-bound regime.