AI Engineering ← Writing

SQL agents that don't hallucinate: patterns for production text-to-SQL

Text-to-SQL looks magical in a demo and falls over on a real warehouse. Here are the schema retrieval, query validation, database role and semantic layer patterns that keep production SQL agents honest.

By Lekke Logic Published 2026-05-29 Read 12 minutes Tags Text-to-SQL · AI Agents · Data Engineering

"Just point an LLM at the database and let people ask questions." Every internal data product proposal starts there in 2026, and a depressing number ship before anyone has thought about what happens when the model invents a column, joins tables that should not be joined, or locks production for nine minutes.

Text-to-SQL is not a hard problem in the demo. It is hard in production, where the schema has 3000 columns, the data is dirty, the dialect is BigQuery for one team and Snowflake for another, and a single confidently wrong answer ends up in a board pack. This post is about the patterns that separate the demos from the systems that survive real users.

Why text-to-SQL looks magical in demos and breaks in production

The benchmarks tell the story. On Spider 1.0, the academic text-to-SQL benchmark, GPT-4o reaches 86.6% execution accuracy. On Spider 2.0, the enterprise benchmark published as an ICLR 2025 Oral that uses real BigQuery, Snowflake and SQLite warehouses with more than 3000 columns per database, the same model originally scored only 10.1%. That is the difference between a feature that ships and a feature that lies to your CFO.

The gap is not the model. It is the schema. Academic benchmarks use small, clean, single-dialect databases with descriptive column names. Real warehouses have legacy naming, multiple sources of truth, soft-deleted rows, and join keys nobody has documented in five years. BIRD-SQL, the dirty real-world benchmark, gives the upper bound: human experts hit 92.96% execution accuracy on the test set across 12,751 question-SQL pairs from 95 large databases spanning 37 professional domains. Current LLM agents trail that. A production text-to-SQL agent has to be engineered with the expectation that the model will frequently get the schema wrong. Everything downstream of the LLM call exists to catch those mistakes.

Schema retrieval: semantic search over tables, columns and sample rows

The first thing that breaks at scale is the prompt. You cannot fit a 3000-column schema into context, and even if you could, the model would not use it well. The pattern that works is retrieval-augmented generation over the schema itself: embed every table description, column name, column comment, and a handful of sample rows into a vector store. At query time, embed the question, retrieve the top-k most relevant tables and columns, and pass only that subset into the prompt. Vanna AI is the canonical open-source reference, a Python framework that uses RAG over DDL, documentation and prior question-SQL pairs stored in a vector store such as Qdrant.

Two things to bolt on. First, store and retrieve verified question-SQL pairs as part of the corpus. When an analyst writes a query that worked, save it. When a user upvotes an answer, save it. Those become high-signal few-shot examples. Second, index sample values, not just column names. "Customer status" might mean "active, churned, dormant" in one table and "1, 2, 3" in another, and the model needs to see those before it writes a WHERE clause.

Query validation: parse, EXPLAIN and dry-run before you ever execute

Never run the model's first output. Treat the generated SQL as untrusted input that has to pass three gates before it touches the warehouse.

Gate one is syntactic. Parse with a dialect-aware parser (sqlglot, or your engine's native parser). If it does not parse, feed the error back and ask the model to fix it. Most one-shot failures are caught here for the price of one extra LLM call.

Gate two is semantic. Run EXPLAIN. It catches unknown tables, unknown columns, type mismatches and ambiguous joins without executing anything. Feed errors back to the model. This is the single highest-leverage validation step you can add: it costs nothing and catches the failure mode that does the most damage, confidently generated SQL referencing columns the model imagined.

Gate three is shape. Dry-run with a LIMIT or against a sampled copy. A question like "how many customers signed up last week?" should return one row with one numeric column. If it returns 4000 rows, the agent built the wrong query.

Locking the database down: read-only roles, row level security and allow-lists

Real safety comes from the database role, not the prompt. No "you must not run DELETE" in a system prompt will save you from a prompt injection that talks the model into doing it. The defence is that the role connected to the agent has no DELETE privilege at all.

PostgreSQL ships a predefined role pg_read_all_data that grants SELECT on every table, view and sequence plus USAGE on schemas, without bypassing row level security. That is your baseline read-only agent role. Layer on a statement timeout, a connection-level row limit, and an allow-list of schemas.

For multi-tenant data, row level security is the right primitive. Postgres RLS policies are enforced even for users with broad SELECT privileges, unless the role is explicitly granted BYPASSRLS. You pin every agent query to a tenant_id at the session level, the policy filters every row, and the agent physically cannot see another tenant's data even if it tries.

This matters more than it sounds. OWASP placed prompt injection at the number one spot in the OWASP Top 10 for LLM Applications 2025, warning that broad database, API and code execution access amplifies the blast radius of a single injected instruction. Keysight's research on database-query-based prompt injection shows attackers manipulating LLM agents into generating staged SQL probes and exfiltration queries, chaining prompt injection with SQL injection. The mitigation is not better prompts. It is a database role with no privileges to abuse.

Result sanity checks and observability for silent failures

The most dangerous failure mode is the one where everything succeeds: SQL parses, EXPLAIN passes, the query executes, a number comes back, and the number is wrong. Silent failures are how text-to-SQL ends up in incident reviews.

Build sanity checks into the response. If the user asked for a count, the result should be a scalar integer. If they asked for a trend, you should have a date column, a numeric column, and a sensible number of rows. Is the value within an order of magnitude of yesterday's? Is the row count plausible? Does the time range match what was asked? Flag anomalies before showing them.

Log every step. Question, retrieved schema, generated SQL, validation results, executed query, result shape, user feedback. Without that trail you cannot debug bad answers, you cannot build the verified question-SQL corpus that improves accuracy over time, and you cannot answer the regulator when they ask what your AI system did on a specific date.

Semantic layers and tool-restricted agents as the safer default

For a lot of internal data products, the right answer is not free-form text-to-SQL at all. It is a governed semantic layer with an LLM that picks among approved metrics.

The dbt Semantic Layer is the production-grade version of this. dbt MetricFlow compiles YAML-defined metrics and dimensions into SQL, and dbt Labs explicitly positions it as a safer data interface for LLMs than free-form text-to-SQL, because the model only chooses among governed metrics and entities. If "monthly recurring revenue" is a metric in your semantic layer, the model can ask for it sliced by region, product or cohort. It cannot invent a new way of calculating it.

The same principle applies to tool-restricted agents that expose a small number of parameterised SQL templates as tools. The model picks the tool, fills in arguments, and never writes raw SQL. For most internal use cases that is the right ceiling. If you do need to build, Defog has open-sourced SQLCoder, a family of fine-tuned LLMs for text-to-SQL with weights released under CC BY-SA 4.0 on Hugging Face for personal and commercial use, useful when you need to keep the model inside your network.

When text-to-SQL is the wrong tool, and what to build instead

If the question is the same five questions every Monday morning, build five dashboards. If it requires regulatory-grade accuracy (financial reporting, audit outputs, board metrics), do not let an LLM compose the SQL, ever. If the user community is small and technical, give them a notebook with a copilot, not a chat box that pretends to be a data team. Text-to-SQL is at its best as an exploration layer on a governed warehouse, with a user who can sense-check the answer. It is at its worst as a "self-service analytics" replacement for an actual data team, sold to people who have no way to know when it is wrong.

Want a SQL agent that survives production?

Our AI Workflow Automation engagement designs the schema retrieval, validation, database role and observability layers for production text-to-SQL agents, against your actual warehouse and your actual regulatory posture. You ship a system you can defend, not a demo.

See the AI Workflow Automation service

POPIA, FSCA and what South African teams must control before going live

If your agent can read personal information, POPIA reads onto every design decision above. Section 19 of the Protection of Personal Information Act requires responsible parties to take appropriate, reasonable technical and organisational measures to prevent unlawful access to or processing of personal information, and to continually update those safeguards as new risks emerge. A text-to-SQL agent that can SELECT from a customer table is a new risk surface. Administrative fines run up to ZAR 10 million per infringement notice for serious breaches.

The controls POPIA requires here are the ones already described: a least-privilege read-only role, row level security pinning queries to the right data subjects, statement timeouts, full logging of prompts, SQL and results, and a documented incident response path. For FSCA-regulated entities, layer on retention and access policies that match your existing data governance framework. If you cannot tell the regulator who asked what, when, and what the agent returned, you do not have a production system. You have a liability.

Key takeaways

Academic accuracy numbers do not survive contact with real warehouses, so any production text-to-SQL agent has to be engineered around the assumption that the model will frequently get the schema wrong.
Schema retrieval, query parsing, EXPLAIN-based dry runs and result sanity checks should be treated as non-negotiable layers, not optional polish.
Real safety comes from the database role, not the prompt: read-only users, row level security, statement timeouts and query allow-lists contain the blast radius when prompt injection happens.
A governed semantic layer such as dbt MetricFlow is often a stronger default than free-form text-to-SQL because the LLM only picks among approved metrics and dimensions.
For South African deployments, POPIA Section 19 and the Information Regulator's enforcement powers make logged, least-privilege agent access a compliance requirement, not a nice-to-have.

The pattern is consistent across every production text-to-SQL system that has not embarrassed its team. The LLM is one component, not the system. The system is the retrieval, validation, role, semantic and observability layers that surround it. Build those, and an agent that answers questions about your data is a useful, defensible product. Skip them, and you have shipped a confident liar with a database connection.

RelatedMore writing

Production AI

Agents that ship.

SQL agents, document agents, decision systems. Engineered with the validation, observability and database controls that survive real users.

Book a discovery call See Services