HiBob Runs 2,500 Internal GPTs: OpenAI's New Enterprise Adoption Metric

Enterprise AI vendors need metrics that make adoption look decisive on a roadshow slide. Custom GPT deployment count is emerging as that metric: it implies active investment, domain-specific configuration, and organizational commitment, without requiring disclosure of actual usage patterns, accuracy rates, or cost efficiency. That makes it effective marketing and dangerous procurement data.

What Raw Deployment Counts Actually Measure

HiBob, the London-based HR-tech company behind the Bob platform, illustrates why the metric deserves scrutiny. HiBob is valued at nearly $2.7 billion and has raised approximately $574 million in venture funding, serving 1.4 million employees across mid-market and enterprise clients according to TechRadar’s April 2026 review. The platform includes custom automation and workflow options, and the company has been expanding its AI capabilities through acquisitions, including the AI-driven financial planning platform Mosaic in 2025.

When vendors cite deployment counts for enterprise customers, the number could represent anything from fully custom agent workflows to lightly configured prompt templates. Without the vendor or customer publishing a detailed breakdown, the count is a headline, not a datapoint.

Why OpenAI’s PBC Restructuring Creates Pressure for New Metrics

OpenAI reported $13.1 billion in 2025 revenue and completed its restructuring to a for-profit Public Benefit Corporation in 2025, with Microsoft holding 27% and a $500 billion valuation from an October 2025 share sale. The PBC conversion creates pressure for enterprise metrics that demonstrate traction beyond API-call volume.

Seat count is a blunt instrument. Two companies with identical ChatGPT seat counts may have wildly different levels of actual usage. API-call volume is better but exposes the commoditized nature of inference. Custom GPT count is different. It implies active investment, domain-specific configuration, and organizational commitment. It also sounds impressive without requiring disclosure of actual usage patterns, accuracy rates, or cost efficiency.

That makes it an effective marketing metric and a dangerous procurement metric.

Custom-GPT Sprawl: The Governance Problem Nobody Budgeted For

Deploying thousands of custom GPTs inside a single organization creates a governance surface that most enterprises are not equipped to manage. The problem is not theoretical.

SafeGPT (arXiv:2601.06366) proposes two-sided guardrails for enterprise LLM use: input-side detection and redaction of sensitive data, plus output-side moderation to prevent unethical or incorrect responses. The paper’s premise is that enterprises deploying LLMs at volume need automated safety infrastructure, not just policy documents.

The AI Assurance Pyramid (arXiv:2605.23459) argues that enterprise AI failures are categorically different from traditional software bugs. The paper proposes a five-layer testing strategy focused on continuous risk reduction rather than correctness verification. The implication for organizations running thousands of custom GPTs: each GPT is a potential failure mode, and the testing overhead scales with deployment count, not with user count.

What Anthropic and Google Should Counter With

If OpenAI is going to lead with deployment count, Anthropic and Google need equivalent narratives. The absence of comparable case studies is notable. Both companies have enterprise customers deploying AI tools at volume, but neither has publicly foregrounded a raw deployment-count metric.

Anthropic’s enterprise positioning has centered on safety and reliability, not volume. Google’s Gemini enterprise play runs through Workspace integration, where the relevant metric is user activation, not custom-tool count. Both approaches are arguably more honest measures of adoption, but they are harder to compress into a headline number.

The competitive question is whether OpenAI’s metric catches on with CIOs and procurement teams. If GPT count becomes a standard RFP checkbox, Anthropic and Google will need either their own deployment-count stories or a convincing argument for why the metric itself is misleading.

What AI Procurement Teams Should Ask in Their Next Renewal

The shift toward deployment-count metrics changes what matters in a multi-vendor AI procurement conversation.

Specific questions for renewal conversations:

For OpenAI customers: Request per-GPT usage analytics. If thousands of custom GPTs are deployed, how many are active? What is the distribution of usage across the fleet? The difference between median and mean usage per GPT will reveal whether the deployment count reflects real adoption or configuration sprawl.
For Anthropic customers: Ask for deployment-equivalent metrics. How many Claude projects, workbenches, or tool-use configurations are active? The absence of a GPT-equivalent count is not necessarily a weakness, but it will be framed as one if OpenAI’s metric gains traction.
For Google customers: Workspace activation rates are the existing metric. Ask how many custom Gemini agents are deployed and what the per-agent cost structure looks like. Google has the infrastructure to match OpenAI’s metric if it chooses to.

The underlying issue is not which vendor has the bigger number. It is whether the number measures anything useful. A large-scale custom-GPT deployment that no one audits, governs, or measures outcomes from is a cost center, not a capability.

Frequently Asked Questions

Does the 2,500-GPT figure mean HiBob uses only OpenAI?

No source confirms HiBob relies exclusively on OpenAI. The Bob platform includes a beta tool called Bob Companion that reads documents and answers policy questions across multiple systems, and HiBob offers per-area AI toggles plus a global off switch. That architecture suggests provider-agnostic routing, not single-vendor lock-in.

What governance infrastructure should exist before a company allows unrestricted internal GPT creation?

The SafeGPT framework models this as a two-sided problem: input pipelines that detect and redact sensitive data before it reaches the model, and output moderation that filters policy-violating or unsupported responses. Organizations that skip either layer end up with governance theater: written policies that say ‘don’t paste PII into GPTs’ while employees routinely do so because no automated barrier enforces the rule.

How do AI deployment failures differ from conventional software bugs in a large GPT fleet?

The AI Assurance Pyramid paper argues that LLM failures produce plausible-looking outputs that pass superficial quality checks, unlike software crashes that surface immediately. A fleet of 2,500 GPTs can silently generate incorrect policy answers, biased review text, or compliance-violating suggestions without triggering any alert, because nothing ‘breaks’ in the conventional sense.

What specific AI capabilities does HiBob expose beyond custom GPT workflows?

HiBob’s Bob platform offers AI-powered CV summaries, AI-generated job descriptions, and performance-review insights as built-in features. The 2025 Mosaic acquisition also brought financial-planning AI into the platform, giving HiBob tooling across both HR and finance that a raw GPT-deployment count fails to capture.