Every CUDA Kernel Pays a Launch Tax: The Host-to-Device Walkthrough

A CUDA kernel launch is a host-to-device round trip that produces no FLOPs until the driver has marshalled parameters, enqueued a command, the GPU has fetched it, and the scheduler has set up execution. NVIDIA’s own measurement of a PyTorch element-wise kernel put that round trip at roughly 20 microseconds on a Tesla V100, before the kernel ran its first instruction. (NVIDIA Nsight Systems guide)

Every line of CUDA you write eventually funnels into one of two APIs layered over the same NVIDIA kernel-mode driver. Knowing which one is on the call stack clarifies what is and is not under your control, and why a toolkit upgrade sometimes breaks a build that a driver upgrade would not.

Which CUDA API are you actually calling, Runtime or Driver?

Most CUDA code in the wild calls the Runtime API (cuda* symbols), a convenience layer that auto-initializes the context, links against cudart, and compiles through nvcc with the <<<grid, block>>> launch syntax. (CUDA Runtime vs Driver API, the mental model)

The Driver API (cu* symbols) sits underneath it. You call cuInit explicitly, manage contexts and modules by handle, and the marshalling is visible rather than implied. The one property that matters operationally is that the Driver API maintains a stable ABI: NVIDIA guarantees existing cu* symbols keep their binary contract across driver releases. That is what makes old-runtime-plus-new-driver work as forward compatibility, and it is also why new-runtime-plus-old-driver can fail at initialization, since the Runtime API is pinned to the toolkit version rather than the driver. (CUDA Runtime vs Driver API, the mental model)

Neither API is faster. The Runtime API is a thin wrapper; the <<<>>> launch syntax desugars to a Driver API call. The distinction is about control and compatibility, not throughput.

What happens during a single kernel launch?

At the Driver API level, cuLaunchKernel takes kernel arguments through an explicit parameter buffer keyed by CU_LAUNCH_PARAM_BUFFER_POINTER and CU_LAUNCH_PARAM_BUFFER_SIZE, terminated by CU_LAUNCH_PARAM_END. (CUDA Driver API types, CUDA 12.0 archive)

That buffer is the host’s contribution to the marshalling, and the Runtime API’s <<<>>> syntax is what wraps it. The modern path also attaches per-launch attributes through CUlaunchAttributeValue: execution priority (CU_LAUNCH_ATTRIBUTE_PRIORITY), cluster dimensions, cooperative launch flags, programmatic stream serialization, launch-completion events, and synchronization policy. (CUlaunchAttributeValue docs) These are the knobs that sit between the host call and the GPU’s scheduling decision.

So a launch is not “call a function and the GPU runs it.” It is: serialize parameters into a buffer, attach per-launch attributes, enqueue the resulting command onto a stream, and wait for the GPU’s command processor to fetch and schedule it. The GPU side is not instantaneous either. It includes context switches and waits imposed by stream ordering on the device, which is why two launches on the same stream can show different effective startup costs.

Where do the launch microseconds go?

NVIDIA splits launch overhead into three buckets, and the split matters because each bucket has a different fix. (NVIDIA Nsight Systems guide)

CPU wrapper overhead. The full duration of the host-side launch API call, including driver mutex contention. This is what you pay inside cudaLaunchKernel before anything reaches the device.
Memory overhead. CPU↔GPU data movement, tracked separately from kernel time.
GPU launch overhead. The GPU retrieving the command and beginning execution, including context switches and stream-ordering waits.

The 20-microsecond V100 figure is launch latency as NVIDIA defines it: the interval between the start of the launch API call and the start of kernel execution, which by definition includes the duration of the API call itself. (NVIDIA Nsight Systems guide) Read it as an end-to-end number, not a GPU-side number.

Why do small kernels pay the most?

Launch overhead is a fixed latency largely independent of workload size, so the cost dominates when the kernel itself is cheap.

The arithmetic is brutal at the small-kernel end. If a launch API call takes 10 microseconds, the host can enqueue at most 100,000 kernels per second, regardless of how fast the GPU is (NVIDIA Nsight Systems guide). That ceiling is set by the host-side call, not the silicon. A kernel that does 100 microseconds of work loses roughly 10 percent to launch overhead. A kernel that does 5 microseconds of work spends the bulk of its wall-clock time not computing.

There is a subtlety the Nsight timeline makes obvious once you know to read it. When a sequence of asynchronous launches shows rising launch latency across its run, that is not inefficiency; it means the CPU is enqueueing faster than the GPU consumes. That is the goal. You want the GPU, not the host, to be the bottleneck, and a host that gets ahead is a host that is successfully hiding the API cost behind device execution. (NVIDIA Nsight Systems guide)

For the memory-overhead bucket specifically, the standard hiding technique is overlapping kernel execution with asynchronous host↔device copies through streams: the GPU runs one kernel while input for the next uploads and output from the previous downloads. (NVIDIA Nsight Systems guide)

How much can you cut the launch tax?

A community benchmark on an RTX 5070 Ti (CUDA 13.0, driver 580.126.09) measured per-kernel time across four overhead-elimination strategies against a naive baseline. (cuda-kernel-launch-lab)

Strategy	Per-kernel time	What it removes
Naive baseline	52.92 µs	nothing
CUDA Graph	46.98 µs	~6 µs/launch of CPU wrapper overhead
Kernel Fusion	44.43 µs	launch tax plus intermediate memory traffic
Mega Kernel	43.92 µs	all launch overhead, by collapsing launches
Dynamic Parallelism	49.35 µs	partial: shifts launches to the device

The ranking is the takeaway, not the absolute microseconds. CUDA Graphs eliminate per-launch CPU wrapper overhead by replaying a captured sequence, which is why frameworks increasingly wrap decode loops in graphs. Kernel fusion removes both the launch tax and intermediate memory traffic at once. A mega-kernel collapses every launch into one. Dynamic parallelism only partially helps, because it moves launches onto the device rather than removing them, so it trades host-side overhead for device-side recursion overhead.

What does launch overhead do to inference?

For autoregressive inference, launch overhead is nearly free during prefill and severe during decode at small batch sizes. (Inferensys glossary on kernel launch overhead)

Prefill amortizes the tax over large GEMMs. Decoding, where each step emits a token or two, makes the per-launch cost visible. At batch size n=1 the overhead is severe enough to shape the whole latency budget; at large batches it is effectively hidden, because each launch does proportionally more useful work. (Inferensys glossary on kernel launch overhead)

The practical consequence is that the operator-fusion and batch-size tuning that inference frameworks advertise as bandwidth optimizations are, to a first approximation, also launch-overhead-hiding optimizations. Continuous batching’s value is not only better arithmetic utilization. It pushes decode from the single-batch regime, where the per-launch tax dominates, into the regime where the tax is amortized across many requests sharing each kernel. The GPU bandwidth story gets the press releases; the driver-queue story is what makes single-batch decoding slow.

These mechanics run on a toolchain that is itself moving. CUDA 13.2, the release NVIDIA foregrounds on its current toolkit page, extends CUDA Tile support to Ampere and Ada architectures, adds closures and recursion to the cuTile Python surface, and unifies the ARM ecosystem into a single toolkit. (CUDA Toolkit page) None of that changes the host-to-device launch model described here, but it is the 13.x line your driver and runtime are now built against, which is why the absolute microsecond figures in this piece carry a “re-measure on your stack” caveat rather than a guarantee.

When the tax is the whole story

The launch tax is the part of GPU programming that is easy to forget because the marketing is always about throughput and memory bandwidth. The kernels that matter for interactive inference are not the big GEMMs. They are the small ones: normalization layers, element-wise ops, the pointwise fused kernels that frameworks increasingly fold into single launches precisely because launching them separately would cost more than running them. The driver queue, not the memory bus, is where single-batch decoding quietly spends its budget, and the frameworks that win on tokens-per-second at batch size one tend to be the ones that have collapsed the most launches, not the ones with the fastest kernels.

Frequently Asked Questions

Does the same launch-overhead model apply to AMD ROCm or HIP?

HIP mirrors CUDA’s Runtime API nearly symbol-for-symbol, so hipLaunchKernel wraps a parameter buffer the same way cuLaunchKernel does. The three-bucket taxonomy transfers; the absolute microseconds differ because ROCm routes through a different user-mode driver and command processor, not because the model changes.

Why does dynamic parallelism only partially help compared to a mega-kernel?

Dynamic parallelism requires compute capability 3.5 or higher and charges a device-side scheduling cost on every parent-to-child spawn that can exceed the host wrapper it displaced. It relocates the tax to the GPU rather than deleting it, which is why the 5070 Ti benchmark ranks it between the naive baseline and CUDA Graphs.

How do you measure one launch’s overhead without the sync cost contaminating it?

Bracket the kernel with two CUDA events on the same stream and read the elapsed time asynchronously, which isolates the launch interval. cudaDeviceSynchronize-based timing folds the queue drain into the measurement, the same scope inflation that pushed the RTX 5070 Ti figure to 52.92 microseconds per kernel. Use Nsight Systems for the timeline and Nsight Compute to profile a single kernel.

Which of the three overhead buckets does a CUDA Graph actually remove?

Replay skips the per-launch CPU wrapper cost, the bucket the Graph column in the 5070 Ti benchmark trims by roughly six microseconds. The GPU launch bucket persists, because every node in the graph still triggers a command-processor fetch, and clusters or cooperative kernels inside a graph require cuLaunchKernelEx rather than the plain launch path.

Is there a way to eliminate the per-launch tax rather than just hide it?

Persistent kernels keep one launch resident on the SMs and use cooperative groups or grid synchronization to swap logical work without returning to the driver queue, which is how some low-latency inference engines dodge host-side launches between tokens. The cost is lower occupancy and worse scheduler flexibility when the workload is sparse, so the technique suits steady decode loops more than bursty prefill.