Oct 24, 2023

The Current State and Evaluation of Text-to-SQL

The role of Data Analysts and Business Intelligence Analysts often involves translating business questions into SQL queries, acting as intermediaries between humans and computers. However, advancements in Natural Language Processing (NLP) and Large Language Models (LLMs) could potentially replace analysts with language models. This would eliminate the need for human manpower and allow access to databases without requiring expert SQL knowledge.

The latest Text-to-SQL models have achieved impressive accuracy rates. For example, a state-of-the-art model achieved a 79.1% execution accuracy and a 97.8% valid SQL ratio when evaluated on the Spider development set. In comparison, OpenAI’s Codex davinci, even without fine-tuning, achieved a 67.0% execution accuracy and a 91.6% valid SQL ratio. If a language model can generate accurate SQL queries without any corrections, its performance may even surpass that of a human. This highlights the potential advantage of using language models, as they can quickly provide valid SQL queries and desired statistics, which might not always be achievable for humans on their first attempt.

With multiple approaches and solutions flooding the market, we are left with the problem of evaluation. Which approach is most efficient? Which one more reliably produces accurate answers? Which one adapts to different datasets best? To help answer these questions, the open-source industry and academia put forth several benchmarks, but the three most used today are:

WikiSQL
Spider
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)

WikiSQL

Introduced by Salesforce in late 2017, WikiSQL was the first truly large compendium of data built for the text-to-SQL use case. However, it has one major drawback: simplicity.

All of the provided SQL queries are exceedingly simple, with only SELECT, FROM, and WHERE clauses. Furthermore, the tables in the dataset have no linkages to other tables. Although models trained on WikiSQL can still work on new databases, they can only answer simple natural language questions that then translate into simple SQL queries.

For this reason, most of the recent research in the world of text-to-SQL focuses on more complex benchmarks. In fact, the WikiSQL leaderboard only has submissions from 2021 or earlier. With multiple submissions achieving a test accuracy of over 90% (with the best-performing submission reaching 93%), practitioners are now shifting focus to much more complex query generation, for which WikiSQL falls woefully short.

Spider

The Spider dataset aims to cover some of the shortcomings of the WikiSQL dataset. Developed through the efforts of 11 Yale students spending over 1,000 man hours, the Spider dataset introduces two critical elements: complexity and cross-domainality.

Complexity: The SQL queries go beyond the straightforward SELECT and WHERE clauses that WikiSQL is limited to, covering the more complex GROUP BY, ORDER BY, and HAVING clauses along with nested queries. Furthermore, all databases have multiple tables linked through foreign keys, allowing for complicated queries that join across tables.
Cross-domainality: With 200 complex databases across a high number of domains, Spider is able to include unseen databases in the test set, allowing us to test the model’s generalizability.

Evaluation of different submissions incorporates the following:

Component Matching: Do the individual components of the SELECT, WHERE, GROUP BY, and ORDER BY clauses match? Are the extracted KEYWORDS correct?
Exact Matching: Do all of the above components match exactly?
Execution Accuracy: Is the answer correct?
SQL Hardness: Queries are divided into four levels (easy, medium, hard, and extra hard) and weighted accordingly for the final evaluation.

There are a few variations of the Spider dataset that are used to evaluate the robustness and generalizability of models under different perturbations, such as Spider-Syn (used to test how well text-to-SQL models adapt to synonym substitution) and Spider-DK (tests how well text-to-SQL models incorporate added domain knowledge).

BIRD

This dataset was compiled by researchers from multiple global institutions to be more realistic than WikiSQL and Spider.

Because the data was collected from real-world scenarios, they retain their original, “dirty” format.
It also provides external knowledge, similar to how real-world developers may have external knowledge from metadata, docs, or other existing context stores.

The BIRD dataset also underscores the importance of efficient queries. The evaluation method for BIRD is the first to include a Valid Efficiency Score (VES), a new metric designed to measure the efficiency along with the usual execution correctness of a provided SQL query.

Conclusion

Text-to-SQL is an intriguing field that holds significant promise for both human-computer interaction research and practical business applications. While advancements in Large Language Models (LLMs) have provided some progress, particularly in handling simple questions, they have only scratched the surface when it comes to complex problems like text-to-SQL. Currently, no existing solution in the market can rival human performance, even when dealing with slightly more intricate queries. However, despite this limitation, the future of text-to-SQL remains exciting, with ample opportunities for further exploration and development.