LLM-based Text2SQL

Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B., & Zhou, J. (2023). Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363.

个人总结: 一篇 LLM 在 Text2SQL 数据集上的 prompt engineering 的实验报告. 在文中评测的两个数据集中效果是开源方案中最好的. 提出的 prompt 方案 DAIL-SQL 融合了现有的几种 RAG 方法.

数据集

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.

实际上看给出的 Data Examples, 即使是 EXTRA HARD 的样例, 涉及的数据库和 SQL 相比实际都相当简单.

[Extra Hard] What is the average life expectancy in the countries where English is not the official language?

SELECT AVG(life_expectancy)
FROM country
WHERE name NOT IN 
   (SELECT T1.name
    FROM country AS T1 JOIN
    country_language AS T2
    ON T1.code = T2.country_code
    WHERE T2.language = "English"
      AND T2.is_official = "T")

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.

评价指标

Execution Accuracy. 这个指标有很多种叫法. 生成的 SQL 执行结果是否与答案 SQL 结果相同.
Exact Set Match. 把 SQL 分解成若干子句, 每个子句再拆成词的集合. 弄成集合规避顺序问题, 比如 SELECT col1, col2 和 SELECT col2, col1 等价. 详见这里.
Valid Efficiency Score. 首先执行结果要符合答案, 其次评估效率.

Prompts

Question Representation

Basic Prompt. 给出相关表的 schemas, 接上 QA, 并以 A: SELECT 提示模型补全. 没有 instruction.

Table continents, columns = [ContId, Continent]
Table countries, columns = [CountryId, CountryName, Continent]
Q: How many continents are there?
A: SELECT

Text Representation Prompt. 在 Basic Prompt 基础上加上 instructions.

Given the following database schema:
continents: ContId, Continent
countries: CountryId, CountryName, Continent

Answer the following: How many continents are there?
SELECT

OpenAI Demostration Prompt. 当成 SQL 让模型补全, 把指示信息放在注释.

### Complete sqlite SQL query only and with no explanation
### SQLite SQL tables, with their properties:
#
# continents (ContId, Continent)
# countries (CountryId, CountryName, Continent)
#
### How many continents are there?
SELECT

Code Representation Prompt.

/* Given the following database schema: */
CREATE TABLE continents (
    ContId int primary key,
    Continent text,
    foreign key (ContId) references countries (Continent)
);

CREATE TABLE countries (
    CountryId int primary key,
    CountryName text,
    Continent int,
    foreign key (Continent) references continents (ContId)
);

/* Answer the following: How many continents are there? */
SELECT

Alpaca SFT Prompt.

Below is an instruction that describes a task, paired
with an input that provides further context. Write a
response that appropriately completes the request.

### Instruction:
Write a SQL query to answer the question "How many continents are there?"

### Input:
continents (ContId, Continent)
countries (CountryId, CountryName, Continent)

### Response:
SELECT

In-Context Learning

考虑 k-shot: 从训练集 (question-sql pairs) 中选 k 个放入 prompt.

Random.
Question Similarity Selection. 根据 question 相似度 kNN.
Masked Question Similarity Selection. 把 question 中的领域相关的表名, 列名, 值等 mask 掉, 再 kNN.
Query Similarity Selection. It employs a preliminary model to generate SQL query $s’$ using target question and database, where this generated $s’$ can be regarded as an approximation of target SQL query $s^\ast$. Then it encodes queries from examples into binary discrete syntax vectors according to their keywords. After that, it chooses $k$ examples by considering both similarity to the approximated query $s’$ and diversity among selected examples.

Example Organization

Full-Information Organization.

/* Given the following database schema: */
${DATABASE_SCHEMA}
/* Answer the following: How many authors are there? */
SELECT COUNT(*) FROM authors

/* Given the following database schema: */
${DATABASE_SCHEMA}
/* Answer the following: How many farms are there? */
SELECT COUNT(*) FROM farm

${TARGET_QUESTION}

SQL-Only Organization.

/* Some SQL examples are provided based on similar problems: */
SELECT COUNT(*) FROM authors

SELECT COUNT(*) FROM farm

${TARGET_QUESTION}

DAIL-SQL

这篇论文提出的方法, 缝合了上述所有方法.

用 Code Representation Prompt 表示 question.
Selection. Consider both questions and queries to select candidates. Specifically, DAIL Selection first masks domain-specific words in both target question $q$ and example questions $q_i$ in the candidate set. It then ranks the candidate examples based on the Euclidean distance between the embeddings of masked $q$ and $q_i$. Simultaneously, it calculates the query similarity between the pre-predicted SQL query $s’$ and $s_i$ in the candidate set. Finally, the selection criterion prioritizes the sorted candidates by question similarity with a query similarity greater than a predefined threshold. In this way, the selected top $k$ examples have good similarity with both question and query.
Organization. Preserve the mapping information between questions and SQL queries and also improve the token efficiency. 缝合两种 organization 但是省些 token.

/* Some example questions and corresponding SQL queries are provided based on similar problems: */

/* Answer the following: How many authors are there? */
SELECT COUNT(*) FROM authors

/* Answer the following: How many farms are there? */
SELECT COUNT(*) FROM farm

${TARGET_QUESTION}

最后是用这套 prompt 在 GPT-4 上达到 sota. 微调则是对开源小 LLM 进行, 因为没钱调 GPT-4.

其他

Self-Consistency Improves Chain of Thought Reasoning in Language Models 文中用这个 trick 提升了很少的点, 问题是耗时增加多倍.
Awesome-Text2SQL
We Fine-Tuned GPT-4 to Beat the Industry Standard for Text2SQL 微调了 GPT-4, 但是效果还没别人 prompt engineering 好. 人家强调别人针对数据集做了特殊处理, 他们这个更通用. 不知道实际如何.