Vercel's In-Function Concurrency: What It Does to Cold Starts and Billing

Vercel’s in-function concurrency lets one warm function instance absorb multiple simultaneous requests instead of spinning up a new instance per invocation. The practical effect is twofold: cold starts happen less often because each warm instance does more work, and billing moves toward active CPU rather than idle duration, so handlers that spend their lives waiting on a database stop being billed for the wait. The win is real but workload-shaped, and it quietly reverts one of serverless’s oldest promises, the guarantee that one request means one isolated execution.

How does in-function concurrency change the serverless model?

Under classic Vercel serverless, one request mapped to one container instance: a request arrived, an instance booted or warmed, served it, then went idle or died. In-function concurrency, the core of what Vercel calls Fluid compute, collapses that unit. A single instance now handles multiple concurrent invocations by filling the idle gaps in already-running instances, so a function blocked on a database reply can service a second request in the meantime.

Mechanically, the old model isolated each invocation in its own container, what Autonoma’s analysis summarizes as “one request, one container instance,” with each instance independent under burst traffic. Fluid replaces that by handling multiple invocations within a single function instance, per the Fluid docs. The launch material frames the ceiling as a many-to-one model that “can handle tens of thousands of concurrent invocations.” The interesting part is not the marketing ceiling but the inversion of the isolation boundary. What used to be a hard wall between requests becomes a shared process, and that has consequences further down.

The optimized concurrency that drives the model is available only for the Node.js and Python runtimes. June 2026 additions, WebSocket support reaching public beta and zero-config Node server deployments, per the Vercel changelog, confirm that this is now the baseline compute surface new deployments inherit rather than a setting to discover.

What happens to cold starts under in-function concurrency?

Cold starts become less frequent, not extinct. Because each warm instance serves more requests, a given burst needs fewer fresh instances, so fewer requests pay the boot tax. Vercel pairs this with a Rust-based runtime and bytecode caching on Node.js 20+ that pre-compiles function code after first execution to soften the cold starts that still occur.

Two cold-start triggers remain structural regardless. A region with no warm instances still boots cold, and a new deployment still brings new code that has to initialize. The third-party Autonoma analysis is blunt about this: Fluid reduces cold-start frequency but does not eliminate cold starts. Anyone who reads a vendor cold-start headline and budgets for zero startup latency on a fresh deploy or a new region is reading the chart wrong.

How does billing change: active CPU versus idle duration?

Vercel’s compute pricing shifts away from wall-clock duration toward billing based on actual compute usage. The Fluid launch page describes this as “billing based on actual compute usage,” and Autonoma’s analysis frames it as a shift from pure invocation count to CPU time consumed. The decisive change for cost is the move to active CPU. Under the old model, a function blocked for two seconds on a slow downstream API still billed you for two seconds of compute, because the instance was alive and occupied. Under active-CPU billing, a handler that is simply waiting on network no longer bills for the wait.

That is a genuine structural improvement for the right workload, and it is the foundation of every cost-savings number Vercel publishes. It is also why those numbers should be read as conditional rather than universal.

Reconciling the savings numbers

Two figures circulate, and they are not measuring the same thing.

Vercel’s marketing headline claims “Reduce compute costs by up to 85%,” per the Fluid compute launch changelog. The beta changelog is more conservative, reporting that overlapping invocations “can increase efficiency 20%-50%, reducing gigabyte-hours and lowering costs.”

These are consistent only if you read them as describing one workload class. The headline figure is a ceiling on best-case, heavily I/O-bound traffic; the changelog range is the beta’s measured efficiency band. Treat the savings as a function of how much of your handler’s time is idle I/O, not as a flat discount.

The workload fork: when do I/O-bound handlers win and CPU-bound handlers regress?

Whether concurrency saves you money or costs you latency depends entirely on what your function does while a request is open.

For I/O-bound handlers, database queries and external API calls, the instance spends most of its time waiting on network, so concurrent requests share that wait with little CPU contention. This is where the efficiency gains actually accrue. For CPU-bound functions, image processing, cryptography, anything that saturates a core, concurrent requests now share the same CPU and contend for it. Vercel is unusually direct about this in the beta changelog: the capability “may increase latency for purely CPU-bound workloads.” During the beta Vercel is capping the number of concurrent invocations per instance and raising the limit gradually, which is itself an admission that the safe ceiling is workload-dependent.

The cost-savings claim is therefore workload-shaped. A function that calls an LLM API and blocks for eight seconds is an ideal candidate. A function that resizes images is not, and under concurrency it may get slower and bill more, not less.

The shared-state trap: module-level mutations and race conditions

Concurrency reverts the isolation guarantee that masked a whole class of latent bugs.

A common serverless pattern is to cache a database client or config object in a module-level variable, initialized once outside the handler. Under one-request-one-instance, that variable was effectively single-tenant: only one request touched it at a time. Under in-function concurrency, that same variable is now shared across concurrent requests on the same instance. Any mutation of it, a counter, a lazily-initialized client, a mutable cache, is now a potential race condition that the old per-instance isolation used to hide.

Vercel frames this correctly as a pre-existing code smell rather than a Fluid bug, per the Autonoma analysis: the bug was always there, the platform just stopped masking it. The same logic applies to event-loop contention. A handler that does heavy synchronous work now blocks other concurrent requests on the same instance instead of sitting alone in its own instance.

Error isolation, at least, is preserved. The Fluid docs specify that when uncaught exceptions or unhandled rejections occur in Node.js, Fluid logs the error and lets in-flight requests finish before stopping the process, so one broken request does not crash the others sharing the instance.

How do you tune the per-instance concurrency limit?

The concurrency limit per instance is configurable, and it is the single knob that decides whether Fluid helps or hurts a given function.

Setting the limit low effectively restores the original one-request-per-instance behavior, which is the right call for any CPU-bound or shared-state-heavy handler. Higher limits suit heavily I/O-bound functions that spend most of their time waiting on network. There is no universal correct number. Finding it requires load-testing your specific function mix and watching both cost and p99 latency, because the two can move in opposite directions as you raise concurrency on a CPU-bound workload.

The honest framing is that Fluid trades a fixed, safe, expensive isolation model for a tunable, cheaper, contention-prone one. The default is tuned for the common case, but the common case is I/O-bound API handlers, and your workload may not be.

Why are AI and LLM handlers the canonical beneficiary?

The workload that benefits most is the one that spends the most time blocked on a slow external call, and few workloads block longer than LLM inference.

An LLM API call can take seconds to minutes, and under the old model every parallel request that was just waiting on that call sat in its own instance, billing you for idle compute across all of them. Under active-CPU billing with in-function concurrency, those waiting requests share a single instance and stop billing during the wait. The dev.to analysis frames this as the feature’s most cost-relevant use case: GenAI handlers that do nothing but call out and wait are precisely the functions where idle compute dominated the bill.

This is also why Vercel is pushing Fluid so hard in 2026. The compute economics of AI inference over serverless were broken in a specific, fixable way, and in-function concurrency is the fix. The same model that saves a bored API handler a few cents saves an LLM proxy real money, because the idle-time tax scales with how long each call blocks. The new public-beta WebSocket support lands on the same concurrency model, which is a natural fit for handlers that stream tokens back over a long-lived connection.

The takeaway for an operator is narrower than the marketing suggests. Audit your handlers, separate the I/O-bound from the CPU-bound, fix the shared-state bugs the old model was hiding, and tune per function. Fluid is a genuine improvement for the workload it was built for. It is not a free performance upgrade, and treating it as one is the fastest way to ship a latency regression on top of a billing improvement.

Frequently Asked Questions

What concurrency value works for a heavily I/O-bound function?

Autonoma’s analysis pegs ~10 concurrent invocations as a reasonable starting point for handlers that spend most of their wall-clock blocked on network, versus 1 for any function that mutates module-level state. The safe ceiling moves with the ratio of idle I/O to saturated CPU in your handler, so that starting number has to be load-tested per function.

Can a function spin up excessive usage by calling itself under Fluid?

Fluid ships built-in recursion protection to stop a self-calling function from multiplying its own traffic across the many-to-one instance model. Without it, a recursive handler running concurrently could compound invocations faster than the old one-request-per-instance model ever allowed, turning a loop bug into a large billing event.

What happens to functions on runtimes other than Node.js and Python?

Only Node.js and Python get the optimized concurrency path. A Go, Ruby, or .NET handler still runs under the classic one-request-per-instance model and bills on wall-clock duration rather than active CPU, so the savings Vercel advertises do not reach mixed-language deployments running outside those two runtimes.

How does this compare to AWS Lambda’s provisioned concurrency?

Lambda has offered per-instance concurrency since 2020, but its provisioned concurrency is a paid opt-in for keeping execution environments warm, separate from the invocation-sharing setting. Vercel’s move was structural: the many-to-one model has been on by default for new projects since April 23, 2025, so deployments inherit the shared-state and CPU-contention tradeoffs without configuring anything.

Does Vercel’s advertised concurrency ceiling apply during the public beta?

The launch material cites a many-to-one ceiling of tens of thousands of concurrent invocations per instance, but during the public beta Vercel caps the per-instance limit and raises it gradually because the safe number depends on the workload. Anyone budgeting against the headline ceiling should expect much lower real limits until the beta lifts the cap.