Benchmarking NaturalSQL against State of the Art LLMs


In this benchmarking study, NaturalSQL-6.7B-v0 is evaluated against leading industry models, including GPT-4 and Defog's SQLCoder-7B for SQL generation in complex scenarios. The study focuses on complex questions and employs the SQL-Eval framework for a comprehensive assessment.

NaturalSQL shows remarkable proficiency in handling compound questions and outperforms larger models in several categories. This study demonstrates that with targeted fine-tuning, even smaller models can achieve state-of-the-art results in specific domains like SQL generation for complex business queries.

This study will test NaturalSQL-6.7B against SQLCoder-7B and other models on the two tests below:

  1. Small Sample of Complex Questions
  2. SQL-Eval - Programmatic SQL Model Evaluation Framework

Table of Contents #


TLDR #

NaturalSQL excels at generating sql for complex questions and beats industry leading models that are twice it's size in multiple categories. NaturalSQL is also much stronger at generating working SQL for compound / complex questions compared to other models of it's size.

1. SQL-Eval Results #

I used sql-eval by Defog, to benchmark NaturalSQL on 200 novel instruction / sql tests that were not trained on. We can see how it compares to other models and see a breakdown of performance by category.

What is SQL-Eval? #

Our testing procedure comprises the following steps. For each question/query pair:

  • We generate a SQL query (possibly from an LLM).
  • We run both the "gold" query and the generated query on their respective database to obtain 2 dataframes with the results.
  • We compare the 2 dataframes using an "exact" and a "subset" match.

SQL Eval evaluates the LLM on 200 questions across different categories and assesses correctness of the model.

SQL-Eval Percentage Correct #

SQLEval Results

Huge thanks to the Defog team for the open source sql-eval framework. Based on the results, we can see the very first iteration of the NaturalSQL-6.7B models, is on par with industry leading models of its size and beats gpt-3.5 and claude models.

SQL-Eval Performance by Query Category #

SQLEval by Category

Here we can see how NaturalSQL compares to other models on different SQL categories. Interestingly, NaturalSQL, matches or beats both SQLCoder-7B and SQLCoder-2-15B on 50% of the categories:

  • where
  • join
  • ratio

NaturalSQL-6.7B beats SQLCoder-7B in 4 out of 6 categories. However, we can see the category that really brings down the average is: date. NaturalQuery scores the lowest at 36.0. This will be a strong area of focus in the next iteration.

NaturalSQL by Category

2. NaturalSQL on Complex Questions #

NaturalSQL excels at complex questions. Here is a comparison of Defog's SQLCoder-7B and ChatDB's NaturalSQL-6.7B on compound SQL questions:

Question NaturalSQL SQLCoder
Which employee had the highest salary in the 'Marketing' department, and what is the total budget of that department?
Identify the customer who placed the highest number of orders last year, and list the names and prices of the products they ordered most frequently.
Find the supplier that provided the most products currently out of stock, and show the names of their out-of-stock products.
Determine the customer with the largest total order value last month, and list all the products and their quantities they ordered.
Who is the employee that handled the most orders last quarter, and what is the status of their most recent order?

Testing Defog & NaturalSQL on Compound Questions #

What is a Compound Question? #

Enterprise companies typically aren't wanting to only ask simple and easy questions such as:

How many products did I sell this month?

The real power of text to SQL models like NaturalSQL is the ability to answer compound, multi-part questions like these below:

Compound Questions:

Which supplier supplied us the most products last year, show me our best selling product from them.

Which point guard had the highest points per game in March, how many steals did they average?

For the customers who ordered over 100$ worth of products, I want to know what the best selling product was

Creating a Base Line Test Set #

For the first test, I wanted to get a baseline of both NaturalSQL-6.7B and SQLCoder-7B on 5 complex questions for a random dataset.

I used ChatGPT to generate a SQL DDL with 5 tables with 10-15 columns. You can checkout my ChatGPT shared link here so you can verify these results weren't cherry picked.

Here is the prompt I used: create a 5 table sql create table sql where each table has 10-15 columns of different data types. Pretty simple.

Compound Questions #

For the SQL tables, I asked ChatGPT to generate 5 compound questions, in which it replied:

Which employee had the highest salary in the 'Marketing' department, and what is the total budget of that department?

Identify the customer who placed the highest number of orders last year, and list the names and prices of the products they ordered most frequently.

Find the supplier that provided the most products currently out of stock, and show the names of these out-of-stock products.

Determine the customer with the largest total order value last month, and list all the products and their quantities they ordered.

Who is the employee that handled the most orders last quarter, and what is the status of their most recent order?

SQL Tables #

CREATE TABLE employees (
    employeeid SERIAL PRIMARY KEY,
    firstname VARCHAR(50),
    lastname VARCHAR(50),
    birthdate DATE,
    email VARCHAR(100),
    phone VARCHAR(20),
    salary NUMERIC(10, 2),
    hiredate TIMESTAMP,
    departmentid INT,
    address VARCHAR(200),
    city VARCHAR(50),
    state VARCHAR(50),
    zipcode VARCHAR(10),
    country VARCHAR(50),
    isactive BOOLEAN
);

CREATE TABLE departments (
    departmentid SERIAL PRIMARY KEY,
    departmentname VARCHAR(100),
    managerid INT,
    location VARCHAR(100),
    phonenumber VARCHAR(20),
    budget NUMERIC(10, 2),
    startdate DATE,
    enddate DATE,
    description TEXT,
    email VARCHAR(100),
    website VARCHAR(100)
);

CREATE TABLE products (
    productid SERIAL PRIMARY KEY,
    productname VARCHAR(100),
    supplierid INT,
    categoryid INT,
    quantityperunit VARCHAR(50),
    unitprice NUMERIC(10, 2),
    unitsinstock SMALLINT,
    unitsonorder SMALLINT,
    reorderlevel SMALLINT,
    discontinued BOOLEAN,
    registered TIMESTAMP,
    description TEXT
);

CREATE TABLE orders (
    orderid SERIAL PRIMARY KEY,
    customerid INT,
    employeeid INT,
    orderdate TIMESTAMP,
    requireddate TIMESTAMP,
    shippeddate TIMESTAMP,
    shipvia INT,
    freight NUMERIC(10, 2),
    shipname VARCHAR(100),
    shipaddress VARCHAR(200),
    shipcity VARCHAR(50),
    shipregion VARCHAR(50),
    shippostalcode VARCHAR(10),
    shipcountry VARCHAR(50),
    status VARCHAR(20)
);

CREATE TABLE customers (
    customerid SERIAL PRIMARY KEY,
    companyname VARCHAR(100),
    contactname VARCHAR(100),
    contacttitle VARCHAR(50),
    address VARCHAR(200),
    city VARCHAR(50),
    region VARCHAR(50),
    postalcode VARCHAR(10),
    country VARCHAR(50),
    phone VARCHAR(20),
    fax VARCHAR(20),
    email VARCHAR(100),
    website VARCHAR(100),
    preferredcontactmethod VARCHAR(50),
    registrationdate DATE
);

Testing both SQLCoder-7B and NaturalSQL-6.7B #

Question: Which employee had the highest salary in the 'Marketing' department, and what is the total budget of that department?

SQLCoder (Failed ❌) #

  • The join condition in the total_budget CTE is incorrect. It joins departments.departmentid with highest_salary.employeeid, which are unrelated fields. The correct join should be on departments.departmentid with employees.departmentid.
  • Also, MAX(employees.salary) should be outside the GROUP BY clause.
WITH highest_salary AS
  (SELECT employees.employeeid,
          employees.firstname,
          employees.lastname,
          MAX(employees.salary) AS max_salary
   FROM employees
   JOIN departments ON employees.departmentid = departments.departmentid
   WHERE departments.departmentname = 'Marketing'
   GROUP BY employees.employeeid,
            employees.firstname,
            employees.lastname),
     total_budget AS
  (SELECT departments.budget
   FROM departments
   JOIN highest_salary ON departments.departmentid = highest_salary.employeeid)
SELECT highest_salary.firstname,
       highest_salary.lastname,
       total_budget.budget
FROM highest_salary,
     total_budget;

NaturalSQL (Passed✅) #

_NaturalSQL is finetuned to use ILIKE and _ often since user rarely mentions a text column name in full form.*

This query correctly retrieves the first name, last name, and salary of the employee with the highest salary in the 'Marketing' department, along with the total budget of the department. It uses ILIKE for a flexible search, and ORDER BY with LIMIT 1 to get the highest salary.

SELECT e.firstname,
       e.lastname,
       e.salary,
       d.budget
FROM employees e
JOIN departments d ON e.departmentid = d.departmentid
WHERE d.departmentname ILIKE '%Marketing%'
ORDER BY e.salary DESC
LIMIT 1;

Question: Identify the customer who placed the highest number of orders last year, and list the names and prices of the products they ordered most frequently.

SQLCoder (Failed ❌) #

  • Incorrect Join in product_orders CTE: The query attempts to join the products table with the orders table using p.productid = o.productid. However, the orders table does not have a productid column, making this join invalid as per the provided schema.
  • Logical Error in customer_product_orders CTE: The join condition co.customerid = pp.productid is incorrect because it equates a customer ID with a product ID, which are unrelated entities in the schema.
WITH customer_orders AS
  (SELECT o.customerid,
          COUNT(*) AS num_orders
   FROM orders o
   WHERE o.orderdate BETWEEN (CURRENT_DATE - interval '1 year') AND CURRENT_DATE
   GROUP BY o.customerid),
     product_orders AS
  (SELECT p.productid,
          COUNT(*) AS num_orders
   FROM products p
   JOIN orders o ON p.productid = o.productid
   WHERE o.orderdate BETWEEN (CURRENT_DATE - interval '1 year') AND CURRENT_DATE
   GROUP BY p.productid),
     customer_product_orders AS
  (SELECT co.customerid,
          pp.productid,
          co.num_orders,
          pp.productname,
          pp.unitprice
   FROM customer_orders co
   JOIN product_orders pp ON co.customerid = pp.productid)
SELECT cpo.customerid,
       cpo.productname,
       cpo.unitprice
FROM customer_product_orders cpo
WHERE cpo.num_orders =
    (SELECT MAX(num_orders)
     FROM customer_product_orders)
ORDER BY cpo.num_orders DESC NULLS LAST;

NaturalSQL (Failed ❌) #

  • Non-existent Table Reference: The query references a table order_details in the FrequentlyOrderedProducts CTE, which does not exist in the provided schema.
  • Incorrect Join and Column Reference: In FrequentlyOrderedProducts, it attempts to join order_details with products on op.productid = p.productid, which cannot be validated due to the absence of order_details. Also, the query assumes a quantity column in order_details which is not present in the schema.
WITH LastYearOrders AS
  (SELECT customerid,
          COUNT(*) AS order_count
   FROM orders
   WHERE orderdate >= (CURRENT_DATE - INTERVAL '1 year')
   GROUP BY customerid
   ORDER BY order_count DESC
   LIMIT 1),
     FrequentlyOrderedProducts AS
  (SELECT op.orderid,
          op.productid,
          p.productname,
          p.unitprice
   FROM order_details op
   JOIN products p ON op.productid = p.productid
   JOIN LastYearOrders ly ON op.orderid = ly.orderid
   ORDER BY op.orderid,
            op.quantity DESC)
SELECT c.companyname,
       fop.productname,
       fop.unitprice
FROM FrequentlyOrderedProducts fop
JOIN customers c ON fop.customerid = c.customerid
JOIN orders o ON fop.orderid = o.orderid
WHERE o.orderdate >= (CURRENT_DATE - INTERVAL '1 year')
ORDER BY fop.orderid,
         fop.quantity DESC;

Question: Find the supplier that provided the most products currently out of stock, and show the names of their out-of-stock products.

SQLCoder (Failed ❌) #

This query incorrectly groups by productname and suppliername, which results in a count of out-of-stock products for each product name, rather than for each supplier. The GROUP BY should be primarily on supplierid or suppliername to find the supplier with the most out-of-stock products. The ORDER BY clause, in this case, does not correctly identify the supplier with the highest count of out-of-stock products.

SELECT p.productname,
       COUNT(*) AS out_of_stock_products_count,
       s.suppliername
FROM products p
JOIN suppliers s ON p.supplierid = s.supplierid
WHERE p.unitsinstock = 0
GROUP BY p.productname,
         s.suppliername
ORDER BY out_of_stock_products_count DESC
LIMIT 1;

NaturalSQL (Failed ❌) #

  • Incorrect Use of GROUP BY and ORDER BY: The query fails to properly aggregate the count of out-of-stock products per supplier and then order by this count in descending order, which is necessary to determine the supplier with the most out-of-stock products.
SELECT s.supplierid,
       s.companyname,
       p.productname
FROM suppliers s
JOIN products p ON s.supplierid = p.supplierid
WHERE p.unitsinstock = 0
GROUP BY s.supplierid,
         p.productname,
         s.companyname
ORDER BY COUNT(p.productid) DESC
LIMIT 1;

Question: Determine the customer with the largest total order value last month, and list all the products and their quantities they ordered.

SQLCoder (Failed ❌) #

  • Fails to aggregate and compute total order value; lacks SUM or GROUP BY for calculating total order value per customer.
  • Incorrect date range: the query is fixed to the start of 2020 instead of focusing on 'last month', and incorrectly references an undefined total_order_value.
SELECT c.customerid,
       c.companyname,
       o.orderid,
       p.productname,
       o.quantityordered,
       to_char(o.orderdate, 'DD-MM-YYYY') AS order_date
FROM customers c
JOIN orders o ON c.customerid = o.customerid
JOIN orderdetails od ON o.orderid = od.orderid
JOIN products p ON od.productid = p.productid
WHERE to_date(to_char(o.orderdate, 'MM-DD-YYYY'), 'MM-DD-YYYY') BETWEEN to_date('01-01-2020', 'MM-DD-YYYY') AND CURRENT_DATE
ORDER BY total_order_value DESC
LIMIT 1;

NaturalSQL (Passed ✅) #

Passed.

  • Correctly calculates the total order value for each customer for the last month, identifying the customer with the highest value using an effective window function.
WITH LastMonthOrders AS
  (SELECT o.customerid,
          o.orderid,
          op.productid,
          op.quantity,
          p.productname,
          op.quantity * p.unitprice AS total_order_value
   FROM orders o
   JOIN order_details op ON o.orderid = op.orderid
   JOIN products p ON op.productid = p.productid
   WHERE o.orderdate >= date_trunc('month', CURRENT_DATE) - INTERVAL '1 month'
     AND o.orderdate < date_trunc('month', CURRENT_DATE) ),
     CustomerTotalOrderValue AS
  (SELECT customerid,
          SUM(total_order_value) AS total_value
   FROM LastMonthOrders
   GROUP BY customerid
   ORDER BY total_value DESC
   LIMIT 1)
SELECT c.customerid,
       c.companyname,
       lm.productid,
       lm.productname,
       lm.quantity
FROM CustomerTotalOrderValue ctov
JOIN LastMonthOrders lm ON ctov.customerid = lm.customerid
JOIN customers c ON c.customerid = lm.customerid;

Question: Who is the employee that handled the most orders last quarter, and what is the status of their most recent order?

SQLCoder (Failed ❌) #

  • The query fails to correctly link the most recent order's status to the top employee identified in employee_orders. The subquery in employee_recent_order does not necessarily return the most recent order related to the employee with the most orders in the last quarter.
  • Schema Mismatch: The query references o.orderstatus in the employee_recent_order CTE, but there's no orderstatus column in the orders table as per the provided schema. The correct column should be status.
WITH employee_orders AS
  (SELECT o.employeeid,
          COUNT(o.orderid) AS order_count
   FROM orders o
   JOIN employees e ON o.employeeid = e.employeeid
   WHERE o.orderdate BETWEEN (CURRENT_DATE - interval '1 quarter') AND CURRENT_DATE
   GROUP BY o.employeeid),
     employee_recent_order AS
  (SELECT e.employeeid,
          e.firstname,
          e.lastname,
          o.orderstatus
   FROM employees e
   JOIN orders o ON e.employeeid = o.employeeid
   WHERE o.orderdate =
       (SELECT MAX(orderdate)
        FROM orders
        WHERE employeeid = e.employeeid))
SELECT eo.employeeid,
       eo.firstname,
       eo.lastname,
       er.orderstatus
FROM employee_orders eo
JOIN employee_recent_order er ON eo.employeeid = er.employeeid
ORDER BY eo.order_count DESC NULLS LAST
LIMIT 1;

NaturalSQL (Passed ✅) #

Passed.

  • This query correctly calculates the employee with the most orders last quarter using a well-structured CTE and appropriate date range.
  • Schema Alignment: It successfully retrieves the most recent order's status for the identified employee, correctly using the status column from the orders table, aligning with the provided schema.
WITH LastQuarterOrders AS
  (SELECT employeeid,
          COUNT(*) AS order_count,
          MAX(orderdate) AS most_recent_order_date
   FROM orders
   WHERE orderdate >= date_trunc('quarter', CURRENT_DATE) - INTERVAL '3 months'
     AND orderdate < date_trunc('quarter', CURRENT_DATE)
   GROUP BY employeeid
   ORDER BY order_count DESC
   LIMIT 1),
     MostRecentOrderStatus AS
  (SELECT o.employeeid,
          o.status
   FROM orders o
   JOIN LastQuarterOrders lqo ON o.employeeid = lqo.employeeid
   WHERE o.orderdate = lqo.most_recent_order_date )
SELECT e.firstname,
       e.lastname,
       mros.status AS most_recent_order_status
FROM employees e
JOIN LastQuarterOrders lqo ON e.employeeid = lqo.employeeid
JOIN MostRecentOrderStatus mros ON lqo.employeeid = mros.employeeid;

Results #

Model Correct Answers (Out of 5)
NaturalSQL 3
SQLCoder 0

NaturalSQL has superior results on complex questions with it's strong grasp of SQL subqueries and complex questions.

Through these tests, it shows, even with a small, but high quality dataset, LLMs can produce state of the art results. NaturalSQL seems to preform very well for its size (6.7B) compared to the leading models. NaturalSQL has had finetuning on:

  • Questions with business logic
  • Advanced Date Queries
  • Window Functions
  • Ratios / Aggregations

That all help it perform really well!

Here are the links to the notebook results and the shared ChatGPT link to verify 😄

Conclusion #

In summary, the benchmarking study demonstrates two key achievements of NaturalSQL-6.7B: first, it has superior performance over SQLCoder-7B in generating SQL for compound questions, and second, it has the ability to outperform both SQLCoder-7B and SQLCoder-15B in certain specific instances. This highlights the effectiveness of NaturalSQL-6.7B, particularly in handling complex queries.

The next iteration will have focused effort on improving short comings in date queries and improve on compound questions.