How Dune AI Measures the Accuracy of LLM‑Generated SQL for Crypto Data
By [Your Name] – DeFi & Blockchain News
March 4 2026
Dune AI, the platform that aims to make blockchain analytics accessible through natural‑language interfaces, has published a detailed account of the testing regimen it uses to judge the quality of large language models (LLMs) that generate SQL queries. The company’s approach combines high‑fidelity benchmark suites with faster, lower‑cost sanity checks, providing a roadmap for any team that wants to turn conversational prompts into reliable on‑chain data queries.
The Challenge: From “SELECT …” to Trustworthy Answers
Generating SQL with an LLM is no longer a research novelty—state‑of‑the‑art models can answer simple questions with near‑perfect accuracy and handle more complex requests about half the time. Those figures, however, come from experiments on well‑structured, static databases. In the crypto world, datasets evolve continuously (new blocks, changing token supplies, fresh contract events), and queries often embed domain‑specific metrics that are unknown to a generic model.
To deliver trustworthy analytics, Dune AI needs a testing framework that can (1) verify that the model produces correct results, (2) flag regressions after model updates, and (3) do so without prohibitive compute costs.
A Multi‑Pronged Evaluation Strategy
| Method | What it tests | Typical effort | Strengths | Weaknesses |
|---|---|---|---|---|
| Golden‑SQL benchmark | End‑to‑end correctness using a curated set of question‑SQL pairs validated by experts. | High – requires manual creation of the dataset and expert review. | • Directly measures result accuracy. • Detects subtle regressions. • Provides high confidence for release decisions. |
• Limited coverage of possible queries. • Time‑consuming to run; LLM inference and query execution can be slow. |
| Unit‑style tests for prompts | Isolated checks that a specific prompt produces an expected result or error. | Medium – scripting of individual cases. | • Fast feedback loop. • Helps developers spot mistakes early. • Scales better than full benchmark runs. |
• Lower fidelity; may miss interactions between prompts. • Requires continuous maintenance as new prompts emerge. |
| Distance‑metric comparison | Quantifies how close the result set of a generated query is to the “golden” result (e.g., Euclidean for numbers, Jaccard for sets, edit distance for text). | Low‑to‑medium – algorithmic, runs automatically. | • Works when data is mutable (e.g., “last 24 h”). • Enables tolerant matching instead of binary pass/fail. |
• Requires calibration of thresholds. • Different metrics are not directly comparable, so results must be bucketed into “close” vs. “not close.” |
| Human review of edge cases | Manual inspection of prompts that trigger large metric deviations. | Low – triggered only for outliers. | • Captures nuanced failures that automated checks miss. • Leverages domain expertise to surface hidden bugs. |
• Not scalable for large volumes. • Depends on availability of knowledgeable reviewers. |
Golden‑SQL: The Core of Dune’s Quality Gate
The most rigorous component is a “golden‑SQL” dataset built around real‑world DeFi questions. Each entry pairs a natural‑language request with a hand‑crafted SQL statement that has been verified by a blockchain analyst. Because the same query can be expressed in multiple syntactically correct ways, Dune compares the result sets rather than the raw text.
To keep results reproducible, the team often pins the query to a specific block height or timestamp, ensuring that the underlying data does not drift between test runs. When the question involves relative time windows (e.g., “the last hour”), a distance metric is used to judge similarity instead of an exact match.
Running the full suite is computationally intensive; Dune speeds it up with parallel test execution tools (such as pytest‑xdist) but still treats it as a “nightly” validation rather than a continuous integration step.
Fast‑Feedback Unit Tests
For day‑to‑day development, Dune complements the benchmark with a battery of unit‑style tests. Each test isolates a single prompt and asserts either a concrete numeric answer or a property of the resulting data (e.g., “non‑empty result set”). The tests act as a smoke‑screen, catching regressions before they reach the benchmark stage.
While these tests lack the statistical rigor of the golden‑SQL approach, they dramatically reduce the turnaround time for developers and keep the CI pipeline responsive.
Metric‑Based Scoring and Bucketing
When a generated query is executed, the system computes an appropriate similarity score:
- Numeric answers – Euclidean distance.
- Textual outputs – Levenshtein (edit) distance.
- Unordered lists – Jaccard similarity.
Scores are then bucketed into “acceptable” or “unacceptable” based on pre‑defined thresholds. Aggregating the buckets across the benchmark set yields a single accuracy figure that can be tracked over model iterations.
Analysis: What This Means for DeFi Analytics
-
Higher reliability for end users. By anchoring evaluation to result‑level comparisons, Dune reduces the chance that a conversational query returns a silently wrong answer—critical when analysts base investment decisions on the data.
-
Scalable yet rigorous workflow. The combination of a heavyweight nightly benchmark with lightweight unit checks provides a practical balance: developers receive quick feedback, while product releases are guarded by a thorough, expert‑validated test suite.
-
Adaptability to mutable blockchain data. Using block numbers or timestamps to freeze data, and fallback distance metrics for relative windows, ensures that tests remain meaningful even as the ledger grows.
- Domain expertise remains essential. Building the golden‑SQL set and interpreting outlier failures require deep knowledge of on‑chain metrics. Automation can only go so far; human insight is still the final arbiter of correctness.
Key Takeaways
- Result‑based comparison beats text comparison – Dune measures the outputs of generated queries rather than the SQL strings themselves, eliminating false negatives caused by syntactic variance.
- A hybrid testing framework is necessary – Nightly golden‑SQL benchmarks provide high confidence, while prompt‑level unit tests give rapid feedback for developers.
- Distance metrics enable testing of time‑sensitive queries – When data evolves, similarity scores (e.g., Euclidean, Jaccard) allow tolerant validation instead of exact matches.
- Human review is still the safety net for edge cases – Automated metrics may miss nuanced logical errors; expert analysts intervene when large deviations are detected.
- Scalable evaluation is achievable with parallel execution – Tools like
pytest‑xdisthelp keep benchmark runtimes manageable, though they remain far from “instant.”
Dune AI’s systematic, layered approach to evaluating LLM‑generated SQL demonstrates that reliable, conversational data access for DeFi is possible—provided that rigorous testing keeps pace with model improvements. As more platforms adopt generative query interfaces, the industry will likely converge on similar multi‑tiered validation pipelines to ensure that the promise of “just ask” does not come at the cost of inaccurate analytics.
For further discussion, the author can be reached on Twitter @dot2dotseurat or via the contact information provided on Dune’s website.
Source: https://dune.com/blog/how-we-evaluate-llms-for-sql-generation
