开源本地最佳文本到 SQL 系统

Open-Sourcing the Best Local Text-to-SQL System
发布时间:2025-07-04 12:47:24    浏览次数:0
Lessons from BIRD and the Path to Enterprise-Scale
伯德的教训和企业规模的道路

Introduction
介绍

Today, we’re open-sourcing Contextual AI’s Text-to-SQL system – the best fully-local solution on the BIRD benchmark. While currently in the top 5 (behind API-based systems using Gemini and GPT-4o), our system held the overall #1 spot in February 2025 and demonstrates that local models can compete with closed-source giants on this critical enterprise task.
今天,我们正在开源的上下文AI的文本到SQL系统 - 鸟基准上最好的全本网络解决方案。虽然目前在前5名(使用Gemini和GPT-4O的基于API的系统之后),但我们的系统在2025年2月排名第一的总数,并证明本地车型可以在这项关键的企业任务上与封闭源巨头竞争。

Why does this matter? While unstructured data comprises the majority of enterprise information, mission-critical operational data – financial records, customer transactions, inventory metrics – lives primarily in structured databases. Accessing this data requires SQL expertise, creating bottlenecks between business stakeholders and the insights they need. Text-to-SQL systems promise to bridge this gap by automatically translating natural language queries into executable SQL statements, transforming questions like “show me Q4’s top 5 highest revenue customers by region” into queries that return actionable results.
为什么这很重要?尽管非结构化数据包括大多数企业信息,但关键任务运营数据(财务记录,客户交易,库存指标)主要生活在结构化数据库中。访问此数据需要SQL专业知识,在业务利益相关者和所需的见解之间创建瓶颈。文本到SQL系统有望通过自动将自然语言查询转换为可执行的SQL语句来弥合这一差距,从而将“向我展示我的Q4 Q4的前5名最高收入客户按地区”转换为返回可操作结果的查询。

The case for local models is particularly compelling here. Structured data often contains sensitive information (financial transactions, customers’ data, personal data, etc.), making privacy-preserving local AI systems highly desirable. Unlike black-box API models, open-source local models can be customized and further optimized for specific use cases. Our system demonstrates that with the right approach, local models can achieve competitive performance without sacrificing data privacy or control.
本地模型的情况在这里特别引人注目。结构化数据通常包含敏感信息(财务交易,客户数据,个人数据等),使其具有隐私的本地AI系统非常可取。与Black-Box API型号不同,可以自定义开源本地型号,并针对特定用例进行进一步优化。我们的系统表明,采用正确的方法,本地模型可以在不牺牲数据隐私或控制的情况下实现竞争性能。

Recent advances in language models have opened new possibilities for Text-to-SQL tasks, with benchmarks like BIRD and SPIDER 2.0 providing valuable signals for measuring progress. Through our investigation at Contextual AI, we’ve uncovered key insights about what makes these systems work – and how to push their boundaries.
语言模型的最新进展为文本到SQL任务开辟了新的可能性,诸如Bird和Spider 2.0之类的基准为衡量进度提供了宝贵的信号。通过在上下文AI的调查中,我们发现了有关使这些系统起作用以及如何突破界限的关键见解。

In this technical blog post, we dive deep into the design decisions and insights behind our pipeline, connecting our findings to the broader literature. We’ll explore how inference-time scaling through parallel candidate generation enables local models to compete with larger API models, why context remains crucial for accuracy, and what our experiments with thinking models revealed about different approaches to scaling compute. While our current system relies on generating multiple candidates (which can be computationally expensive), the use of local models opens doors to further optimization through techniques like parallelization for increased throughputs and reinforcement-learning (RL) training for more efficient sampling.
在这篇技术博客文章中,我们深入研究了管道背后的设计决策和见解,将我们的发现与更广泛的文献联系起来。我们将探讨如何通过平行候选生成进行推理时间扩展,使本地模型能够与较大的API模型竞争,为什么上下文对于准确性仍然至关重要,以及我们对思维模型的实验揭示了有关缩放计算的不同方法。尽管我们当前的系统依赖于生成多个候选者(这可能在计算上很昂贵),但使用本地模型可以通过平行化(例如增加吞吐量和加强学习(RL)培训(RL)训练(RL)培训(以进行更有效的采样)来进一步优化。

For those ready to build, we’ve also created a step-by-step Colab Notebook (link) that serves as a primer for exploring the full system. The notebook walks beginners through implementing the core ideas from our solution, providing a foundation to understand and experiment with the complete pipeline. Whether you’re a technical AI builder looking to solve enterprise data access challenges or a researcher exploring the frontiers of Text-to-SQL, this post provides both practical tools and theoretical insights to advance your work.
对于准备构建的人,我们还创建了一个分步COLAB笔记本(链接),该笔记本电脑(链接)是探索完整系统的底漆。笔记本漫步初学者通过从我们的解决方案中实现核心想法,为完整的管道提供了基础,从而提供了基础。无论您是想要解决企业数据访问挑战的技术AI建造者,还是探索文本到SQL边界的研究人员,这篇文章都提供了实用的工具和理论见解,以推动您的工作。

Contextual-SQL
上下文SQL

Existing SoTA Text-to-SQL frameworks like Chase-SQL and Xiyan-SQL often rely on various components like Schema Linking (retrieving tables and columns relevant to the query), Value Retrieval, Chain of Thoughts (CoT) Reasoning, Query Fixer, etc. to generate SQL queries.
现有的SOTA文本到SQL框架(例如Chase-SQL和Xiyan-SQL)通常依赖各种组件,例如架构链接(检索与查询的表和列),值检索,思想链(COT)推理,查询固定器等来生成SQL Queries。

While value retrieval and schema linking are promising techniques for reducing the query’s complexity as well as reducing the context length of the schema description, imperfect recall from these stages propagate and become a performance bottleneck.
虽然值检索和模式链接是降低查询复杂性以及降低模式描述的上下文长度的有希望的技术,但从这些阶段的召回不完美的回忆传播并成为性能瓶颈。

On the other hand, chain of thoughts reasoning and query refinement are natural choices for scaling inference-time computation. However, the reliability and effectiveness of these approaches are not entirely understood.
另一方面,思想推理和查询细化是缩放推理时间计算的自然选择。但是,这些方法的可靠性和有效性尚未完全理解。

We instead aim to investigate the full potential of scaling inference-time computation via sampling, where parallelization and input context caching are benefits of sampling over more sequential methods like CoT or query refinement. We show that by scaling up a relatively simple candidate generation pipeline and selecting promising candidates carefully, we can achieve a top score on the BIRD benchmark.
相反,我们旨在通过采样来研究扩展推理时间计算的全部潜力,在这种情况下,并行化和输入上下文缓存是对更顺序的方法进行采样的好处,例如COT或查询细化。我们表明,通过扩大相对简单的候选生成管道并仔细选择有希望的候选人,我们可以在鸟基准上获得最高分。

Overview of Contextual-SQL
上下文sql的概述

The core idea behind Contextual-SQL is a 2 stage approach: generate candidates and then identify good ones. More specifically, we provide an informative context to generate a diverse set of candidates, and then select the best candidate by filtering and then ranking candidates.
上下文SQL背后的核心思想是一种2阶段的方法:生成候选者,然后确定好的方法。更具体地说,我们提供了一个有益的环境来生成各种候选人,然后通过过滤和排名候选人选择最佳候选人。

This pipeline captures 2 important principles in building AI systems: (1) The importance of a good context, and (2) The power of inference-time scaling. We give a brief overview of these 2 ideas below and provide more details in the next section.
该管道捕获了建立AI系统的2个重要原则:(1)良好背景的重要性,以及(2)推理时间缩放的力量。我们简要概述以下这两个想法,并在下一节中提供更多详细信息。

The Importance of Context
上下文的重要性

Context is often the key to solving difficult problems with AI. This principle holds with Text-to-SQL. The main input to an LLM for Text-to-SQL, along with the natural language query, is a textual description of the database that includes the tables’ names, descriptions, their columns, and other useful metadata.
上下文通常是解决AI难题的关键。该原理具有文本到SQL。文本到SQL的LLM的主要输入以及自然语言查询是数据库的文本描述,其中包括表的名称,描述,列和其他有用的元数据。

We examine providing context via Data Definition Language (DDL), M-Schema, and adding few-shot examples, and show improved performance with more informative context in Table 1.
我们通过数据定义语言(DDL),M-Schema和添加少量示例来研究上下文,并在表1中显示出更有益的上下文的性能。

The Power of Inference-time Scaling
推理时间缩放的力量

Recent research has shown that one can recover a lot of the gains from RL-tuned models on just the base model (without RL) if one samples enough from the base model and somehow selects the right candidate. For example, [Yue et al’ May 25] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? demonstrates comparable performance of the base model with the RL-trained model with a high enough number of samples, where at K>=128 (right figure), the base model starts to achieve a higher recall than GRPO-tuned models.
最近的研究表明,如果仅从基本模型中的样本足够的样本并以某种方式选择合适的候选人,则可以从基本模型(无RL)上从RL调节模型中恢复很多收益。例如,[Yue等人5月25日]加强学习是否真的在LLM中的推理能力超出基本模型之外的能力?用RL训练的模型具有足够数量的样本,在k> = 128(右图)中证明了基本模型的可比较性能,基本模型开始获得比GRPO-TUN的模型更高的召回率。

Hence, while RL can induce more efficient sampling towards higher reward regions, one can also achieve similar performance gains by sampling enough diverse candidates and searching for promising ones.
因此,尽管RL可以诱导更有效的抽样对更高的奖励区域,但也可以通过抽样足够多的候选人并寻找有希望的人来实现相似的绩效提高。

Getting Started: The Importance of a Good Context
入门:良好背景的重要性

In this section, we introduce 3 ways of providing context and perform experiments (provided in this Google Colab notebook) to demonstrate the importance of context: Data Definition Language (DDL), mSchema (by XiYan-SQL), and adding few-shot examples. Check out the notebook if you want to follow along with our results in this Section.
在本节中,我们介绍了提供上下文和执行实验的3种方法(在本Google Colab笔记本中提供),以证明上下文的重要性:数据定义语言(DDL),MSCHEMA(由Xiyan-SQL),并添加了很少的示例。如果您想跟随我们的结果,请查看笔记本。

A Basic Schema Description: DDL
基本模式描述:DDL

One of the most basic ways of describing a database schema to provide context to a language model is via a Data Definition Language or DDL. A DDL is a subset of SQL that defines and manages database schema structures through statements like CREATE, ALTER, and DROP, specifying tables, columns, data types, constraints, indexes, and relationships within a database. This information provides valuable context to the model for understanding table names, column names, data types, foreign key relationships, and constraints to produce SQL queries that align with the actual database structure. Examples:
描述数据库模型的数据库模型的最基本方法之一是通过数据定义语言或DDL。DDL是SQL的子集,它通过诸如创建,更改和删除,指定表,列,数据类型,约束,索引和关系等语句来定义和管理数据库架构结构。此信息为模型提供了有价值的上下文,以了解表名称,列名称,数据类型,外键关系和约束,以产生与实际数据库结构相符的SQL查询。示例:

Adding More Schema Context with Reflection
通过反射添加更多的模式上下文

M-Schema is an attempt at creating a more informative context on top of a more LLM-friendly presentation of the database’s schema. The key idea behind mSchema is leveraging SQLAlchemy’s reflection to provide connections between tables by including foreign key relationships between tables as well as including examples for each column for improving the model’s comprehension. The example below shows how representative examples are added to each column on top of the column’s name and type.
M-Schema试图在数据库架构的更友好型呈现范围内呈现更详尽的环境。MSCHEMA背后的关键思想是利用Sqlalchemy的反思来提供表之间的连接,包括表格之间的外键关系以及包括每一列的示例以改善模型的理解。下面的示例显示了如何将代表性示例添加到列的名称和类型之上的每个列中。

In-Context Learning and Few-Shot Demonstration
在文章中学习和几次演示

Few-shot examples enable in-context learning for text-to-SQL by providing demonstration pairs of natural language questions and corresponding SQL queries within the prompt. These examples provide hints on what a typical question answer pair looks like and how the model should respond. In the Colab notebook, we show that providing just 1 example demonstration improves the model’s performance.
很少有示例可以通过在提示符中提供自然语言问题和相应的SQL查询来实现文本到SQL的文本到SQL学习。这些示例提供了一个典型的问题答案对的暗示以及模型应如何响应。在COLAB笔记本中,我们显示仅提供1个示例演示可以改善模型的性能。

Experiments: The Importance of Context
实验:上下文的重要性

In the accompanying Google Colab notebook, step by step implementations of building the context for DB schema along with experiments on `credit_card_specialization` subset of the BIRD-eval set (64 questions of varying difficulty) are provided. All you need is a Google API to use Gemini to see the impact of context on the model’s ability to generate the final SQL. The results below demonstrate this point:
在随附的Google CoLab笔记本中,提供了构建DB模式上下文的逐步实现,并提供了“ Credit_Card_specialization”的实验。提供了鸟类词的子集(64个不同的难度问题)。您所需要的只是Google API,可以使用双子座来查看上下文对模型生成最终SQL的能力的影响。下面的结果证明了这一点:

Table 1: Accuracy of different contexts
表1:不同上下文的准确性

Context Accuracy DDL 54.68% mSchema 60.94% mSchema+Fewshot 62.5%
上下文准确性DDL 54.68%MSCHEMA 60.94%MSCHEMA+LIGHSHOT 62.5%

In the next section, we scale up this recipe to achieve top scores on BIRD.
在下一部分中,我们扩大此食谱以在Bird上获得最高分。

Towards SoTA on BIRD: Inference-time Scaling and Candidate Selections
在鸟类上迈向Sota:推理时间缩放和候选选择

Pass@k for execution accuracy evaluates whether at least one of k generated SQL candidates produces the correct result. Increasing k with diversity-enhancing techniques such as higher sampling temperatures or variations in the prompt (via reordering or changing few shot examples) reveals the model’s latent capability to generate correct queries even when the top-1 prediction fails.
通过@k的执行精度评估是否至少有一个K生成的SQL候选者产生正确的结果。通过增强多样性的技术增加K,例如提示中的采样温度或变化(通过重新排序或更改示例示例)揭示了该模型的潜在能力,即使TOP-1预测失败,也可以产生正确的查询。

When pass@k demonstrates favorable scaling behavior (substantial improvement as k increases), this indicates the model’s generations contain the correct answer but suffer from high variance. In this scenario, inference-time scaling strategies can be highly effective – specifically, generating numerous diverse candidates and applying filtering mechanisms such as execution success filtering, consistency-based scoring through majority voting, or learned reward models that rank candidates based on query’s output to select the best SQL query from the candidate pool.
当Pass@k表现出有利的缩放行为(随着k的增加而大大改善)时,这表明该模型的几代包含正确的答案,但遭受了较高的差异。在这种情况下,推理时间缩放策略可以非常有效 - 具体来说,会产生众多不同的候选者并应用过滤机制,例如执行成功过滤,通过多数投票的基于一致性的评分,或者学到的奖励模型,这些奖励模型基于查询的输出来对候选人的输出进行对候选人的最佳SQL查询。

In the Figure below, on Gemini-1.5-Flash with the mSchema+example prompting, we generate 1024 candidates by generating 32 samples each (with temperature 1) across 32 different few-shot examples. Then we measure the pass@k performance of random selection of valid non-duplicated candidates versus consistent selection (top valid SQL candidates with the most votes), where valid candidates mean SQL queries that execute successfully (no execution error).
在下图中,在带有MSCHEMA+示例提示的Gemini-1.5-Flash上​​,我们通过在32个不同的几个示例中生成32个样本(温度1)来生成1024个候选者。然后,我们测量有效的非裁定候选人的随机选择与一致的选择(最高有效的SQL候选者(具有最多票数的最高有效的SQL候选者))的通过@K的性能,其中有效的候选者是指成功执行的SQL查询(无执行错误)。

There, we see that the performance improves quite a bit as the number of candidates increases. This suggests that further improvements can be extracted by having a better candidate selection strategy.
在那里,我们看到,随着候选人的数量增加,性能会有所提高。这表明可以通过拥有更好的候选策略来提取进一步的改进。

Ranking Candidates with a Reward Model
用奖励模型对候选人进行排名

A natural strategy for selecting candidates is to train a scoring model to rank candidate SQLs’ execution outputs given the DB information and user’s query.
选择候选者的自然策略是训练评分模型,以对DB信息和用户查询进行对候选SQLS的执行输出进行排名。

For our pipeline, we train a base Qwen-2.5-32B model on the train split of BIRD to output the probability of a candidate SQL query and its outputs being correct given the DB schema and query as context.
对于我们的管道,我们在鸟类分开上训练基本QWEN-2.5-32B模型,以输出候选SQL查询的概率,并且鉴于DB模式和查询作为上下文,其输出是正确的。

In more details, a large pool of distinct candidate SQL queries is generated for every training question, sorted by their generation likelihood. Every query is then labeled: positives for execution outputs matching the ground‑truth answer and negatives otherwise. During training, for each question within a mini-batch, a positive sample is drawn at random and paired with 15 incorrect candidate SQL queries with the highest likelihood as hard negatives. The training objective for the reward model is to classify correct SQL from the pool of 16. This was implemented with the same codebase we used to train our state‑of‑the‑art reranker.
在更多详细信息中,为每个培训问题生成了大量不同的候选SQL查询,并以其一代的可能性排序。然后将每个查询都标记为:执行输出的阳性与地面真相的答案和负面因素相匹配。在训练过程中,对于小批次中的每个问题,随机绘制一个阳性样本,并与15个不正确的候选SQL查询配对,其可能性最高为硬否负面。奖励模型的培训目标是对16个池进行正确的SQL进行分类。这是通过我们用来训练我们的Art Reranker的相同代码库实施的。

Indeed, as in the Figure below, selecting top valid candidates with the highest scores given by the reward model improves the pass@1 score of Gemini-1.5-Flash dramatically over consistent (majority voting) and random (among valid non-duplicated candidates) selection strategies:
确实,如下图所示,选择具有奖励模型得分最高的顶级有效候选者,可以在一致的(多数投票)和随机(有效的非伪造候选人)选择策略中急剧地提高双子座1.5闪存的传球@1分数:

On the BIRD-dev set, the strategy presented so far already achieves an execution accuracy of 70%, which is already within the top scores.
在Bird-Dev集合中,迄今为止提出的策略已经达到了70%的执行精度,已经在最高得分之内。

Improving Consistency: Log-Probs and Fine-Grained Confidence
提高一致性:原木合并和细粒度的信心

From the consistency result in the previous section, we see that a model’s “confidence” (here, measured by majority voting) in its output can correlate with the output being correct. Another more fine-grained way to measure a model’s confidence in its outputs’ tokens is via its probability assigned to each of its outputs’ tokens.
从上一节的一致性结果来看,我们可以看到,模型的“信心”(以多数投票为准)在其输出中可以与输出相关。衡量模型对输出令牌的信心的另一种更细粒度的方法是通过将其分配给其每个输出令牌的概率。

More specifically, if the output SQLs of a model corresponds to tokens , then the model’s conditional likelihood for the output is or more simply by taking the log of the probability (log-probs):
更具体地说,如果模型的输出SQL与令牌相对应,则模型的输出的条件可能性是或更简单地通过获取概率的日志(log-probs):

Since API models in general do not provide the tokens’ log-probs (nor logits), we perform a similar experiment as before with Qwen2.5-coder-32b, where we use the conditional likelihood score given by the cumulative log-probs:
由于API模型通常不提供令牌的日志构图(也没有逻辑),因此我们对QWEN2.5-CODER-32B进行了类似的实验,在其中我们使用累积日志播放器给出的条件可能性得分:

Here, we see that indeed log-probs provide a slightly better signal than consistency alone at small k.
在这里,我们看到的确实是与单独的小k处的一致性相比,对数构图提供的信号稍好一些。

Combining Confidence with Reward for Ranking
将信心与排名相结合

A natural progression is to combine the log-probs and the reward score together for a better signal. Note that since the reward model is trained to output the probability P(Y|X, Q) of whether the output Y of a SQL candidate X given the query Q is correct or not, combining that with the probability of the SQL candidate P(X|Q) would give us a joint likelihood of the pair: .
自然的进步是将对数构图和奖励得分结合在一起,以获得更好的信号。请注意,由于训练奖励模型以输出概率P(Y | X,Q),即给定查询Q的SQL候选X的输出Y是否正确,因此将其与SQL候选p(x | q)的概率相结合,这将使我们对这对的关节可能性。

Hence, we can create a score as a weighted average between the log probs of the SQL candidate along with its reward. More specifically, parameterized by a weight α, we compute
因此,我们可以在SQL候选者的日志概率以及其奖励之间创建一个分数作为加权平均值。更具体地说,通过重量α进行参数化,我们计算

Note that α=0 and α=1 correspond to log-probs only and reward only, respectively. Sweeping through , we find that 0.4 gives improved performance on a subset of the dev set.
请注意,α= 0和α= 1仅对应于对数 - 合并,并且仅奖励。扫地,我们发现0.4在开发集合的子集上提供了改进的性能。

Performing the same experiment indeed shows a lift in the final score:
进行相同的实验确实显示了最终分数的提升:

At this point, we have achieved a close to SoTA score on the BIRD-dev (around 73% execution accuracy) set with this recipe. This is the main recipe for the full Contextual-SQL submission that achieved SoTA on BIRD.
在这一点上,我们已经在使用此食谱的鸟类DEV(约73%的执行精度)上获得了接近SOTA得分。这是在Bird上实现SOTA的完整上下文SQL提交的主要食谱。

The Contextual-SQL Pipeline
上下文SQL管道

With the key components outlined above, we summarize again the full pipeline here:
在上面概述的密钥组件的情况下,我们在此处再次总结了完整的管道:

Providing the context as mSchema along with 1 example sampled from the train set. Using a temperature of 1, sample n candidates. Repeat step 1 and 2 m times with different few-shot examples to generate a total of n*m candidates. Execute the candidates to keep only the valid ones (no SQL execution error). Finally, use the trained reward model along with the log-probs for selecting the candidate with the highest score.
将上下文作为MSCHEMA以及从火车组中采样的1个示例。使用1个温度,样品N候选物。重复步骤1和2 m次,使用不同的少量示例,以生成总共n*m的候选物。执行候选人以仅保留有效的候选者(无SQL执行错误)。最后,使用训练有素的奖励模型以及原木合并来选择分数最高的候选人。

The full open-source implementation can be found at https://github.com/ContextualAI/bird-sql.
可以在https://github.com/contextualai/bird-sql上找到完整的开源实现。

Diving Deeper: Comparing Generator Models’ Quality
深入潜水:比较发电机模型的质量

This section provides additional experimental results comparing different base models for generating SQL candidates with different temperature settings across different selection methods: Qwen-2.5/3, and Gemini-1.5/2/2.5-flash/pro. The pass@1 score is also shown in a legend box.
本节提供了其他实验结果,比较不同选择方法的不同基础模型,以生成具有不同温度设置的SQL候选者:QWEN-2.5/3和GEMINI-1.5/2/2/2.5-FLASH/PRO。通过@1分数也显示在传奇框中。

Here, we see that models of increasing capabilities (e.g. Gemini-2.5-pro) achieve better pass@1 performance. The thinking models (e.g. Qwen3-thinking) do not seem to outperform the non-thinking ones.
在这里,我们看到,增加功能的模型(例如Gemini-2.5-Pro)实现了更好的通过@1性能。思维模型(例如qwen3思维)似乎并不优于非思想的模型。

Furthermore, note that models that can generate diverse candidates like Qwen-2.5 observes better pass@k performance at higher k:
此外,请注意,可以产生QWEN-2.5(例如QWEN-2.5)的不同候选者的模型观察到更好的PASS@k性能在更高的K:

While higher pass@k might mean that there is more performance to extract during inference-time, there is no guarantee that we can fully recover that performance.
虽然更高的通行证@K可能意味着在推理时间期间提取的性能有更多的性能,但不能保证我们可以完全恢复该性能。

As newer benchmarks like SPIDER 2.0 demand longer context and more computational resources, more efficient approaches are needed beyond simple inference-time scaling.
随着蜘蛛2.0等较新的基准要求更长的上下文和更多的计算资源,除了简单的推理时间缩放之外,还需要更有效的方法。

Conclusion
结论

Our journey with BIRD taught us valuable lessons about Text-to-SQL fundamentals – from the importance of context to the power of inference-time scaling. These insights enabled us to build a system that demonstrates local models can compete while preserving data privacy and enabling customization. The computational costs of candidate generation in our approach could be offset by the flexibility of local models, which enable both parallel processing to boost throughput and reinforcement learning methods to streamline the sampling process—optimizations that point toward potential efficiency gains.
我们与Bird的旅程教会了我们有关文本到SQL基本面的有价值的教训 - 从上下文的重要性到推理时间扩展的力量。这些见解使我们能够构建一个可以演示本地模型的系统,可以在保留数据隐私并实现自定义的同时竞争。我们方法中候选人生成的计算成本可以被本地模型的灵活性所抵消,这既可以使平行处理能够增强吞吐量和强化学习方法,从而简化采样过程 - 典型化指向潜在的效率提高。

Looking forward, our experiences with customers uncovered that enterprise deployments present an entirely different scale of challenge: intricate schemas, massive tables with thousands of columns, messy data, and complex multi-step queries that go far beyond single SQL generation. This is precisely what makes SPIDER2 a critical benchmark – with its 632 real-world workflow problems, databases exceeding 1,000 columns, and production cloud environments (BigQuery, Snowflake), it captures the complexity enterprises actually face. The performance gap speaks volumes: GPT-4o drops from 86.6% accuracy on Spider 1.0 to just 10.1% on SPIDER2, with even o1-preview reaching only 17.1%.
展望未来,我们与客户的经验揭示了企业部署提出了完全不同的挑战规模:复杂的模式,具有数千列的大型表,杂乱的数据和复杂的多步查询,远远超出了单个SQL的生成。这正是使Spider2成为关键基准的原因 - 具有632个现实世界的工作流问题,数据库超过1,000列以及生产云环境(BigQuery,Snowflake),它实际上捕获了复杂性企业的面临。性能差距表示量:GPT-4O从蜘蛛1.0的86.6%的精度下降到Spider2的10.1%,甚至O1-preview也只有17.1%。

Applying insights from our BIRD research to tackle these enterprise-grade challenges is our next challenge. By open-sourcing our BIRD system today – which remains the best fully-local solution – we’re sharing the foundation that’s shaping our approach to the next generation of text-to-SQL systems.
我们的下一个挑战是应用我们的鸟类研究中的见解来应对这些企业级挑战。通过今天开放鸟类系统(仍然是最好的全本网络解决方案),我们正在分享为下一代文本到SQL系统塑造方法的基础。

Stay tuned as we push beyond single-query optimization toward solutions that can handle the multi-dialect SQL workflows and complex reasoning that production environments demand. The gap between current benchmark performance and real-world utility is substantial, but it’s precisely where the most impactful innovations await.
请继续关注我们超越单期优化的方法,以解决可以处理多核心SQL工作流以及生产环境需求的复杂推理的解决方案。当前的基准性能与现实世界实用程序之间的差距是巨大的,但正是最有影响力的创新等待着。

In the meantime, we hope our BIRD system and the insights shared in this post accelerate your own work in this space.
同时,我们希望我们的鸟类系统和本文中共享的见解加速您自己在这个领域的作品。

最新文章

热门文章