Table of Contents

Text-to-SQL technology has reached a tipping point where non-technical users can now query complex databases using plain English. Specialized models like Defog’s SQLCoder-70b achieve 96% accuracy on standard benchmarks, outperforming even GPT-4 on SQL generation tasks. This shift democratizes data access across organizations, eliminating the bottleneck of SQL expertise while maintaining enterprise-grade security and audit capabilities.

What Is Text-to-SQL?

Text-to-SQL is an AI technology that converts natural language questions into executable SQL database queries. Instead of writing complex syntax like SELECT customer_name, SUM(order_total) FROM orders WHERE order_date >= '2025-01-01' GROUP BY customer_name ORDER BY SUM(order_total) DESC, a user simply asks, “Which customers spent the most money this year?”

The technology leverages large language models (LLMs) trained specifically on SQL syntax, database schemas, and the relationship between natural language and structured query language. Unlike general-purpose AI assistants, text-to-SQL models are fine-tuned on thousands of question-query pairs to understand database structures, table relationships, and aggregation functions.

According to Defog, which developed the open-source SQLCoder model family, their training dataset consists of over 20,000 human-curated questions based on 10 different database schemas (Defog GitHub, 2024). None of the schemas in the training data were included in the evaluation framework, ensuring that benchmark results reflect genuine generalization capabilities rather than memorization.

How Does Text-to-SQL Work?

Modern text-to-SQL systems employ several architectural approaches to bridge natural language and database queries:

Retrieval-Augmented Generation (RAG)

Most production text-to-SQL implementations use RAG to provide the model with relevant schema context. When a user asks a question, the system first retrieves metadata about relevant tables, columns, and relationships from the database catalog. This context, combined with the user’s question, forms the prompt sent to the language model.

The Vanna AI framework, for example, uses “Agentic Retrieval” to dynamically fetch schema information before SQL generation (Vanna AI GitHub, 2024). This approach ensures the model only receives relevant table definitions rather than overwhelming it with an entire database schema.

Fine-Tuned Specialized Models

While general-purpose LLMs like GPT-4 can generate SQL, specialized models demonstrate significantly higher accuracy. Defog’s SQLCoder family includes variants ranging from 7B to 70B parameters, each optimized specifically for SQL generation tasks.

SQLCoder models are fine-tuned from Meta’s Llama 3 architecture using domain-specific training data. The SQLCoder-70b model achieves 96% accuracy on date queries, 97.1% on ORDER BY operations, and 85.7% on ratio calculations—metrics that exceed GPT-4’s performance across most categories (Defog SQLCoder GitHub, 2024).

Agent-Based Execution

Advanced implementations like DB-GPT and Vanna 2.0 use agent architectures that go beyond simple query generation. These systems:

  • Verify generated SQL for syntax correctness before execution
  • Apply row-level security filters based on user permissions
  • Generate natural language summaries of query results
  • Produce visualizations alongside tabular data

Vanna 2.0, released in late 2024, introduces “User-Aware at Every Layer” architecture where queries are automatically filtered per user permissions (Vanna AI GitHub, 2024). This addresses the critical enterprise requirement that users should only see data they’re authorized to access.

Accuracy Benchmarks: The Data Behind the Hype

The SQL-Eval framework, developed by Defog and open-sourced in 2024, provides standardized methodology for measuring text-to-SQL accuracy. The evaluation compares generated queries against “gold standard” queries by executing both and comparing result dataframes using exact and subset matching.

Model Performance Comparison

The following table summarizes performance across different SQL operation categories, based on Defog’s published benchmarks:

ModelDate QueriesGROUP BYORDER BYRatiosJOINsWHERE Clauses
SQLCoder-70b96%91.4%97.1%85.7%97.1%91.4%
SQLCoder-7b-v296%91.4%94.3%91.4%94.3%77.1%
SQLCoder-34b80%94.3%85.7%77.1%85.7%80%
GPT-472%94.3%97.1%80%91.4%80%
GPT-4-turbo76%91.4%91.4%62.8%88.6%77.1%
GPT-3.572%77.1%82.8%34.3%65.7%71.4%
Claude-252%71.4%74.3%57.1%65.7%62.9%

Source: Defog SQLCoder GitHub repository benchmark results, 2024

Several patterns emerge from this data:

  • Specialized models outperform generalists: SQLCoder-70b exceeds GPT-4 across most categories, with particularly strong performance on complex operations like JOINs (97.1% vs 91.4%).

  • Size matters, but efficiency is achievable: The SQLCoder-7b-v2 model, with just 7 billion parameters, achieves competitive accuracy while requiring significantly less computational resources than the 70B variant.

  • Ratio calculations remain challenging: All models show lower accuracy on ratio/rate calculations, with GPT-4-turbo dropping to 62.8%—indicating that division operations and percentage calculations remain difficult for current architectures.

Spider Dataset and Academic Benchmarks

Academic research uses the Spider dataset, a collection of 10,181 natural language questions and SQL queries across 200 databases. DB-GPT, an open-source AI-native data application framework, reports achieving 82.5% accuracy on the Spider dataset after fine-tuning (DB-GPT GitHub, 2024).

The DB-GPT framework supports fine-tuning with multiple methods including LoRA, QLoRA, and P-tuning, making it accessible for organizations to train models on their specific database schemas.

Implementation Approaches

Organizations looking to implement text-to-SQL have several architectural options:

Option 1: API-Based Services

Services like Defog provide hosted text-to-SQL capabilities with enterprise features including audit logging, user authentication, and data privacy controls. These solutions require minimal infrastructure but involve ongoing API costs and data transmission considerations.

Option 2: Open-Source Frameworks

Frameworks like Vanna and DB-GPT allow self-hosted deployment, providing full control over data and model selection. Vanna 2.0 supports integration with any LLM provider including OpenAI, Anthropic, Ollama, Azure, Google Gemini, AWS Bedrock, and Mistral (Vanna AI GitHub, 2024).

Self-hosted implementations require:

  • GPU resources for model inference (minimum 16GB VRAM for SQLCoder-34b)
  • Integration with existing authentication systems
  • Database connectivity and schema indexing

Why Does Text-to-SQL Matter?

The business impact of accessible database querying extends beyond convenience:

Democratizing Data Access

Traditional business intelligence requires either SQL expertise or pre-built dashboards. Text-to-SQL enables ad-hoc exploration, allowing analysts to follow curiosity without engineering bottlenecks. As noted by Or Hiltch, Chief Data and AI Architect at JLL: “I have tried pretty much every NL -> SQL model out there, and Defog’s model is by far the best. It’s one of the only examples I know of more broadly that has been able to achieve GPT4-level results” (Defog.ai, 2024).

Reducing Time-to-Insight

Conversational interfaces reduce the time from question to answer from hours (waiting for SQL writing) to seconds. This acceleration compounds across organizations where hundreds of employees interact with data daily.

Enterprise Security Integration

Modern text-to-SQL frameworks address enterprise requirements that early prototypes ignored. Vanna 2.0’s architecture demonstrates user-aware query filtering, audit logging, and rate limiting as first-class features (Vanna AI GitHub, 2024). Row-level security ensures users can only query data they’re authorized to see, while lifecycle hooks enable custom quota checking and content filtering.

Cost Efficiency

Open-source specialized models like SQLCoder provide GPT-4-level accuracy at a fraction of the cost. Organizations can run SQLCoder-7b on consumer GPUs (RTX 4090, Apple M2 Pro with 20GB+ memory) without per-query API fees (Defog SQLCoder GitHub, 2024).

Challenges and Limitations

Despite significant progress, text-to-SQL technology faces ongoing challenges:

Schema Complexity

Models struggle with databases containing hundreds of tables or complex many-to-many relationships. Metadata pruning—selecting only relevant tables for the prompt—remains an active research area.

Ambiguity Resolution

Natural language is inherently ambiguous. “Show me sales for last year” could refer to calendar year, fiscal year, or trailing 12 months. Production systems require clarification workflows or pre-defined business logic.

Data Privacy Considerations

While text-to-SQL reduces the need for data analysts to access raw data, the AI model itself processes schema information. Organizations must evaluate whether sending database structures to third-party APIs complies with data governance policies.

Frequently Asked Questions

Q: How accurate is text-to-SQL technology as of 2026? A: At time of writing, specialized models like SQLCoder-70b achieve 96% accuracy on standard benchmarks, exceeding GPT-4’s performance across most query categories. However, accuracy varies significantly by query complexity—simple SELECT statements approach 99% accuracy while complex ratio calculations remain challenging at 85-91%.

Q: Can text-to-SQL work with my existing database? A: Yes, modern frameworks support PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, SQLite, Oracle, SQL Server, DuckDB, and ClickHouse. The primary requirement is schema access for context retrieval.

Q: Is my data safe when using text-to-SQL tools? A: Self-hosted solutions like Vanna and SQLCoder keep all data within your infrastructure. API-based solutions vary—Defog emphasizes that “your data is never shared with anyone, including with our AI model” (Defog.ai, 2024). Always verify privacy policies and consider schema-only vs. data-access implementations.

Q: What technical resources are required for self-hosted text-to-SQL? A: SQLCoder-34b requires a 4xA10 GPU or consumer GPU with 20GB+ VRAM (RTX 4090, Apple M2 Pro/Max/Ultra with 20GB+ memory). SQLCoder-7b runs on significantly less hardware. CPU inference is possible but slower.

Enjoyed this article?

Stay updated with our latest insights on AI and technology.