Evaluating Large Language Models for SQL Generation: Our Methodology

How Dune AI Measures the Accuracy of LLM‑Generated SQL for Crypto Data

By [Your Name] – DeFi & Blockchain News

March 4 2026

Dune AI, the platform that aims to make blockchain analytics accessible through natural‑language interfaces, has published a detailed account of the testing regimen it uses to judge the quality of large language models (LLMs) that generate SQL queries. The company’s approach combines high‑fidelity benchmark suites with faster, lower‑cost sanity checks, providing a roadmap for any team that wants to turn conversational prompts into reliable on‑chain data queries.

The Challenge: From “SELECT …” to Trustworthy Answers

Generating SQL with an LLM is no longer a research novelty—state‑of‑the‑art models can answer simple questions with near‑perfect accuracy and handle more complex requests about half the time. Those figures, however, come from experiments on well‑structured, static databases. In the crypto world, datasets evolve continuously (new blocks, changing token supplies, fresh contract events), and queries often embed domain‑specific metrics that are unknown to a generic model.

To deliver trustworthy analytics, Dune AI needs a testing framework that can (1) verify that the model produces correct results, (2) flag regressions after model updates, and (3) do so without prohibitive compute costs.

A Multi‑Pronged Evaluation Strategy

Method	What it tests	Typical effort	Strengths	Weaknesses
Golden‑SQL benchmark	End‑to‑end correctness using a curated set of question‑SQL pairs validated by experts.	High – requires manual creation of the dataset and expert review.	• Directly measures result accuracy. • Detects subtle regressions. • Provides high confidence for release decisions.	• Limited coverage of possible queries. • Time‑consuming to run; LLM inference and query execution can be slow.
Unit‑style tests for prompts	Isolated checks that a specific prompt produces an expected result or error.	Medium – scripting of individual cases.	• Fast feedback loop. • Helps developers spot mistakes early. • Scales better than full benchmark runs.	• Lower fidelity; may miss interactions between prompts. • Requires continuous maintenance as new prompts emerge.
Distance‑metric comparison	Quantifies how close the result set of a generated query is to the “golden” result (e.g., Euclidean for numbers, Jaccard for sets, edit distance for text).	Low‑to‑medium – algorithmic, runs automatically.	• Works when data is mutable (e.g., “last 24 h”). • Enables tolerant matching instead of binary pass/fail.	• Requires calibration of thresholds. • Different metrics are not directly comparable, so results must be bucketed into “close” vs. “not close.”
Human review of edge cases	Manual inspection of prompts that trigger large metric deviations.	Low – triggered only for outliers.	• Captures nuanced failures that automated checks miss. • Leverages domain expertise to surface hidden bugs.	• Not scalable for large volumes. • Depends on availability of knowledgeable reviewers.

Golden‑SQL: The Core of Dune’s Quality Gate

The most rigorous component is a “golden‑SQL” dataset built around real‑world DeFi questions. Each entry pairs a natural‑language request with a hand‑crafted SQL statement that has been verified by a blockchain analyst. Because the same query can be expressed in multiple syntactically correct ways, Dune compares the result sets rather than the raw text.

To keep results reproducible, the team often pins the query to a specific block height or timestamp, ensuring that the underlying data does not drift between test runs. When the question involves relative time windows (e.g., “the last hour”), a distance metric is used to judge similarity instead of an exact match.

Running the full suite is computationally intensive; Dune speeds it up with parallel test execution tools (such as pytest‑xdist) but still treats it as a “nightly” validation rather than a continuous integration step.

Fast‑Feedback Unit Tests

For day‑to‑day development, Dune complements the benchmark with a battery of unit‑style tests. Each test isolates a single prompt and asserts either a concrete numeric answer or a property of the resulting data (e.g., “non‑empty result set”). The tests act as a smoke‑screen, catching regressions before they reach the benchmark stage.

While these tests lack the statistical rigor of the golden‑SQL approach, they dramatically reduce the turnaround time for developers and keep the CI pipeline responsive.

Metric‑Based Scoring and Bucketing

When a generated query is executed, the system computes an appropriate similarity score:

Numeric answers – Euclidean distance.
Textual outputs – Levenshtein (edit) distance.
Unordered lists – Jaccard similarity.

Scores are then bucketed into “acceptable” or “unacceptable” based on pre‑defined thresholds. Aggregating the buckets across the benchmark set yields a single accuracy figure that can be tracked over model iterations.

Analysis: What This Means for DeFi Analytics

Higher reliability for end users. By anchoring evaluation to result‑level comparisons, Dune reduces the chance that a conversational query returns a silently wrong answer—critical when analysts base investment decisions on the data.
Scalable yet rigorous workflow. The combination of a heavyweight nightly benchmark with lightweight unit checks provides a practical balance: developers receive quick feedback, while product releases are guarded by a thorough, expert‑validated test suite.
Adaptability to mutable blockchain data. Using block numbers or timestamps to freeze data, and fallback distance metrics for relative windows, ensures that tests remain meaningful even as the ledger grows.
Domain expertise remains essential. Building the golden‑SQL set and interpreting outlier failures require deep knowledge of on‑chain metrics. Automation can only go so far; human insight is still the final arbiter of correctness.

Key Takeaways

Result‑based comparison beats text comparison – Dune measures the outputs of generated queries rather than the SQL strings themselves, eliminating false negatives caused by syntactic variance.
A hybrid testing framework is necessary – Nightly golden‑SQL benchmarks provide high confidence, while prompt‑level unit tests give rapid feedback for developers.
Distance metrics enable testing of time‑sensitive queries – When data evolves, similarity scores (e.g., Euclidean, Jaccard) allow tolerant validation instead of exact matches.
Human review is still the safety net for edge cases – Automated metrics may miss nuanced logical errors; expert analysts intervene when large deviations are detected.
Scalable evaluation is achievable with parallel execution – Tools like pytest‑xdist help keep benchmark runtimes manageable, though they remain far from “instant.”

Dune AI’s systematic, layered approach to evaluating LLM‑generated SQL demonstrates that reliable, conversational data access for DeFi is possible—provided that rigorous testing keeps pace with model improvements. As more platforms adopt generative query interfaces, the industry will likely converge on similar multi‑tiered validation pipelines to ensure that the promise of “just ask” does not come at the cost of inaccurate analytics.

For further discussion, the author can be reached on Twitter @dot2dotseurat or via the contact information provided on Dune’s website.

Source: https://dune.com/blog/how-we-evaluate-llms-for-sql-generation

The Challenge: From “SELECT …” to Trustworthy Answers

A Multi‑Pronged Evaluation Strategy

Golden‑SQL: The Core of Dune’s Quality Gate

Fast‑Feedback Unit Tests

Metric‑Based Scoring and Bucketing

Analysis: What This Means for DeFi Analytics

Key Takeaways

RBA Forecasts $16.7 Billion in Annual Gains from Real‑World Asset Tokenization

Bitcoin Depot appoints former MoneyGram executive as its new chief executive...

Franklin Templeton and Ondo Partner to Offer Tokenized ETFs Accessible Through...

Google aims to complete its migration to quantum computing by 2029.

Recomended

RBA Forecasts $16.7 Billion in Annual Gains from Real‑World Asset Tokenization

Bitcoin Depot appoints former MoneyGram executive as its new chief executive officer.

Franklin Templeton and Ondo Partner to Offer Tokenized ETFs Accessible Through Crypto Wallets

Google aims to complete its migration to quantum computing by 2029.

Monthly Bitcoin options expiry totals $18.6 billion; analysts assess possible price movement toward $75,000

Congressional Hearing Addresses Tokenization of Real-World Assets