Text-to-SQL has been the default abstraction for natural-language analytics for years. Standard benchmarks track incremental accuracy gains, and major data platforms have built products around the premise that if the LLM can just write better SQL, the analytics interface problem is solved. A paper published May 20 on arXiv1 argues the premise is wrong. Not because LLMs write bad SQL, but because SQL is the wrong layer of abstraction for enterprise analytics in the first place.
Why Text-to-SQL breaks down inside enterprises
The standard Text-to-SQL pipeline takes a natural-language question, generates a SQL query, and runs it against a database. That works reasonably well on benchmark datasets. Inside an actual enterprise, the query has to pass through row-level security filters, column-masking policies, audit trails, and lineage tracking before it touches a table. Every one of those checks is typically implemented as a SQL rewrite or a view-layer policy. When an LLM generates raw SQL, it has to either anticipate those rewrites or trust the downstream planner to apply them correctly. Neither is reliable at the edge cases, and the edge cases are where the compliance violations live.
The Analytic Agent paper1 frames this as a fundamental abstraction mismatch. Enterprise analytics pipelines, the authors argue, already encapsulate business logic behind governed APIs for consistency, auditability, and security. Delegating aggregation and filtering logic to an LLM generating SQL introduces reliability and compliance risks that no amount of benchmark tuning resolves. The SQL generator is solving the wrong problem.
How Analytic Agent routes around SQL
The proposed alternative is straightforward in concept: instead of generating SQL, the LLM translates the user’s natural-language intent into a call against a governed analytics API. The API contract enforces permissions, applies masking, records lineage, and returns structured results. The LLM never sees the underlying schema and never writes a query.
The system uses multi-step reasoning to interpret the user’s goal, validate permissions against policy, execute the appropriate governed API call, and generate compliant visualizations. Evaluated on 90 real enterprise use cases constructed by domain experts, the authors report reliable performance across this pipeline.1 The specifics of “reliable” matter here: the abstract does not disclose accuracy percentages, error rates, or failure-mode breakdowns, so the claim remains unquantified at this stage. The paper is also a preprint with no evidence of peer review as of May 2026.
The governance shift: from SQL rewrite to API contract
The structural argument here is independent of whether Analytic Agent itself works at production grade. Row-level security, column masking, and lineage tracking currently live in SQL rewrite layers, applied per-query. This means every new query path, every new table, every schema migration is a potential governance gap. The rewrite rules have to be maintained in lockstep with the data model, and the LLM generating the SQL has no visibility into them.
Moving these checks to an API contract layer changes the failure mode. Instead of trusting the SQL generator to produce a query that the planner will safely rewrite, the API enforces governance at the boundary. The LLM’s job narrows to intent parsing and result presentation. The API surface becomes the governance choke point, and it is a stable one: it changes when the business logic changes, not when someone adds a column to a table.
For data-platform teams, this reframes where to invest effort. Building and maintaining a well-documented governed API catalog, with security and lineage baked into the contract, may yield more reliable natural-language analytics than chasing incremental Text-to-SQL accuracy on benchmarks. The moat is the API catalog, not the SQL generator.
What this means for the current vendor landscape
Major cloud data platforms and open-source tools converge on the same model: generate SQL, run it, return results. All of them share the architectural assumption that SQL is the right target for the LLM’s output.
The Analytic Agent paper does not directly test these systems or demonstrate specific governance failures in any particular implementation. What it does is identify a structural limitation that applies to the entire category: any system that generates raw SQL inherits the governance gaps of per-query SQL rewrites, and those gaps are hard to close from the SQL side. The vendors that recognize this first, and start exposing governed API surfaces as the analytics interface instead of raw SQL, will have a structural advantage over those that keep optimizing the SQL generator.
The May 2026 agentic architecture wave
The Analytic Agent paper is not an isolated contribution. May 2026 has produced a cluster of papers proposing multi-step, policy-aware agent architectures across different domains, all moving away from single-shot LLM tool use.
DDS (Declarative Data Services)2 demonstrates that unbounded agentic discovery, where a coding agent iterates on failure-log feedback without structural constraints, fails to converge on a working data-system stack. The fix: structured typed contracts that decompose the search into bounded sub-problems. The parallels to Analytic Agent are direct: both papers argue that governed contracts, not unconstrained LLM autonomy, are what makes agentic systems reliable in production.
Agentic Agile-V3 shifts the focus to process control, arguing that the central problem for agentic systems is no longer prompt engineering but engineering discipline in permission gating and dependency handling. The paper finds persistent failures in autonomous code generation when these controls are absent.
HANA4 proposes a hierarchical agent-native network architecture, reporting an 86% MTTR reduction. The common thread: multi-step orchestration with explicit policy boundaries, not single-shot generation.
What the paper gets right and what it leaves out
The core insight, that SQL is the wrong abstraction for governed enterprise analytics and that API contracts are a better boundary, is well-argued and directionally correct. The architecture is clear. The problem it identifies is real, and every Text-to-SQL vendor will eventually have to address it, either by moving up the stack to governed APIs or by making their SQL rewrite layers robust enough that the distinction stops mattering.
What the paper does not provide, at least in its abstract, is an empirical comparison. No accuracy baseline against Text-to-SQL systems on the same use cases. No failure-mode taxonomy. No measurement of where the governed-API approach itself breaks down. The 90-use-case evaluation establishes feasibility, not superiority. That is a reasonable scope for a preprint, but it means the paper is currently an architectural argument with a proof-of-concept, not an empirical result.
The governance-vs-SQL tradeoff is also not new. Data engineers have been building semantic layers and governed API surfaces for years. The paper’s contribution is the agentic orchestration layer on top of those surfaces, not the observation that APIs are safer than raw SQL. The distinction matters because it determines what subsequent work should evaluate: not whether governed APIs are better than SQL (they are, for compliance-heavy environments), but whether LLM-based intent parsing and API orchestration is reliable enough to replace the current Text-to-SQL tooling in production.
Frequently Asked Questions
Does this approach work for teams that don’t already have governed analytics APIs?
No. The architecture presupposes a mature API layer encoding row-level security, column masking, and lineage into contract surfaces. The paper’s 90-use-case evaluation was run against existing enterprise API infrastructure with domain-expert-constructed scenarios, not greenfield deployments. Teams without that layer face a significant upfront build before the agentic orchestration adds any value.
How does this differ from LangChain or LlamaIndex pipelines for analytics?
LangChain and LlamaIndex target retrieval-augmented generation — fetching and synthesizing data with LLMs — but treat governance as an external concern. Analytic Agent makes the governance contract the primary interface: the LLM never sees the schema and never writes a query. The distinction matters because retrieval-focused frameworks can bypass row-level security or column masking unless those policies are separately enforced downstream.
What happens when the API catalog doesn’t cover a user’s intent?
The paper doesn’t address this, but the parallel DDS finding is instructive: unbounded agentic discovery against data systems fails to converge when the agent can’t find a matching contract, requiring structured typed boundaries to decompose the problem. A stale or incomplete API catalog would leave Analytic Agent with no reliable fallback, since the architecture deliberately removes raw SQL access as an escape hatch.
Which existing benchmarks would this architecture make irrelevant?
BIRD and SPIDER, the two dominant Text-to-SQL leaderboards, measure how accurately an LLM generates SQL against a known schema. A governed-API architecture sidesteps that entire measurement surface because the model never produces SQL. New evaluation criteria would need to focus on intent-interpretation accuracy, policy-enforcement completeness, and API-selection correctness rather than query-generation fidelity.
Footnotes
-
Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs ↩ ↩2 ↩3
-
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems ↩
-
Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development ↩
-
From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA) ↩