GithubHelp home page GithubHelp logo

natsql's Introduction

NatSQL

This repository contains code for the EMNLP 2021 findings paper "Natural SQL: Making SQL Easier to Infer from Natural Language Specifications".

If you use NatSQL in your work, please cite it as follows:

@inproceedings{gan-etal-2021-natural-sql,
    title = "Natural {SQL}: Making {SQL} Easier to Infer from Natural Language Specifications",
    author = "Gan, Yujian  and
      Chen, Xinyun  and
      Xie, Jinxia  and
      Purver, Matthew  and
      Woodward, John R.  and
      Drake, John  and
      Zhang, Qiaofu",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.174",
    doi = "10.18653/v1/2021.findings-emnlp.174",
    pages = "2030--2042",
}

Environment Setup

Install Python dependency via pip install -r requirements.txt.

Usage

Step 1: Download the Spider dataset

Download the datasets: Spider. Make sure to download the 06/07/2020 version or newer. Unpack the datasets somewhere outside this project and put train_spider.json, dev.json, tables.json and database folder under ./data/ directory.

Run check_and_preprocess.sh to check and preprocess the dataset. It will generate (1) the train_spider.json and dev.json with NatSQLG ; (2) preprocessed tables.json and tables_for_natsql.json ; under ./NatSQLv1_6/ directory.

Step 2: Convert NatSQL to SQL

Run natsql2sql.sh [train/dev] [natsql/natsqlg] to convert the NatSQL to SQL. You should get the SQL queries in results.sql. The evaluation results are different from the paper since we have updated the NatSQL which improves the performance of NatSQLG but decreases the performance of NatSQL.

Evaluation results of converting gold NatSQL with values into SQL:
Train
Exact Match
Train
Execution Match
Dev
Exact Match
Dev
Execution Match
NatSQLG 96.6% 95.7% 97.3% 96.8%
NatSQL 92.9% 93.8% 92.7% 93.4%

Step 3: Convert NatSQL without Values to Executable SQL

To generate executable SQL, you need to find out the possible values in the question in advance to facilitate copying them to SQL. Here, the preprocess code for finding out values is very complicated, which can be implemented in another simpler way. However, due to the severe coupling between this process and our other works, we cannot provide a relatively straightforward implementation. For example, this preprocess code brings values to the SQL and slightly improves the exact match accuracy.

Run natsql2sql_without_values.sh [train/dev] [natsql/natsqlg] to convert the NatSQL without values to executable SQL. You should get the SQL queries in results.sql.

Evaluation results of converting gold NatSQL without values into executable SQL:
Train
Exact Match
Train
Execution Match
Dev
Exact Match
Dev
Execution Match
NatSQLG 96.5% 94.8% 97.7% 96.6%
NatSQL 92.9% 92.9% 93.8% 92.8%

NatSQL V1.6.1

The NatSQL version introduced in our NatSQL paper is V1.6. The V1.6.1 version was used in Spider-SS. It extends set operators for NatSQL and corrects some annotation errors from the original Spider dataset. Therefore, exact match and execution match accuracy in ./NatSQLv1_6_1 are significantly lower than that in ./NatSQLv1_6. This version is not for chasing the Spider leaderboard but is proposed to give a closer NatSQL query to the natural language question.

About SQL2NatSQL

We have not completed the SQL2NatSQL conversion code at present. We welcome contributions to NatSQL.

Acknowledgement

The ./data/20k-original.pkl and ./data/20k.pkl are extract from google-10000-english that is under the LDC license.

The ./data/conceptnet.pkl is extract from conceptnet5 that is under the Creative Commons Attribution Share-Alike 4.0 license.

License

The code and NatSQL except the data in ./data folder are under the CC BY-SA 4.0 license.

natsql's People

Contributors

ygan avatar

Stargazers

 avatar Mohammadhossein (Mo) Malekpour  avatar Hoàng Thành Đạt avatar  avatar  avatar Delaunay Antoine avatar Matthis Houlès avatar Xiuwen Li avatar Zhaolong Zhang avatar  avatar  avatar Guanyu Jiang avatar Ben avatar JohnSaxon avatar  avatar Kevin, He avatar  avatar  avatar Andrey Gershun avatar Claris Toy avatar Fred Bliss avatar  avatar Echozz avatar  avatar runman avatar Mavlarn avatar IceRock avatar 覃悦(Yue Qin) avatar JohnSaxon avatar  avatar Haoyang Li avatar Wen Qiao avatar Max Maslov avatar  avatar Ziyu Yao avatar 子言亦擎 avatar Derek Zhou avatar  avatar HUANG avatar yucheng-zeng avatar  avatar Tom Sherborne avatar Nixon avatar  avatar

Watchers

 avatar

natsql's Issues

Question about NatSQL data generation

Hi. I'm interested in your NatSQL, and i find that you provide your NatSQL queries under NatSQLv1_6/.
Could you please demonstrate how to convert them or share the script on converting original SQLs into NatSQLs?

Question about nat-sql+rat+sql

I am so interested in your excellent work and follow your instructions to reproduce the experiment of nat-sql+rat-sql:

  1. Download the Spider dataset and run check_and_preprocess.sh to preprocess the dataset,
  2. Replace train_spider.json 、dev.json and tables.json in Spider with NatSQLG
  3. Run the command of “python run.py preprocess experiment_config_file” in rat-sql
    An error has been reported here: File "/home/xx/rat-sql/ratsql/grammars/spider.py", line 166, in parse_val raise ValueError(val) ValueError: 56
    I'm wondering if you could share me how to reproduce the whole nat-sql+rat-sql experiment. Thanks a lot!
    Sincerely.

Environment Setup Problems

When attempting to install the dependencies through the requirements.txt, I encounter this error:

NatSQL failure

As a result, running check_and_preprocess.sh fails and tables.json isn't generated.

I attempted to install blis, nltk, and en_core_web_sm separately, but then ran into the following issue when running check_and_preprocess.sh:
NatSQL failure_tuple concat

How to generate NatSQL from Spider train_others.json?

I need to generate NatSQL from train_others.json in the Spider dataset, but I can't find any way to generate it using the tool. Could you tell me how to generate NatSQl from train_others.json? Thus I can use NatSQL in my research project.

Thanks a lot!

不明白算法的整体流程

这个算法是怎么运行的,是如何将文本描述转化成为natsql的,论文中没有明确写,没太明白

Inner imports problem

Probably it worth rewriting inner code imports in relative format

For instance
NatSQL/natsql2sql/natsql2sql.py:
from natsql2sql.preprocess.stemmer import MyStemmer
turn to
from .preprocess.stemmer import MyStemmer

And

NatSQL/natsql2sql/preprocess/stemmer.py:
from natsql2sql.preprocess.match import ALL_JJS
turn to
from .match import ALL_JJS

And so on.

For example, using this repo in GoogleColab inflicts importing errors

Applying NatSQL on BIRD-SQL

I am attempting to apply NatSQL on the BIRD-SQL dataset (https://bird-bench.github.io/), following the provided steps. However, I've encountered an issue during the preprocessing stage.

  1. Dataset Download: I downloaded the required datasets and placed the files in the ./data/ directory.
  2. Preprocessing: I ran check_and_preprocess.sh to check and preprocess the dataset. This script successfully generated tables_bird.json and tables_for_natsql_bird.json.

Issue: The check_and_preprocess.sh script calls generate_spider_examples_with_natsql.py, which requires dev-natsql.json and train_bird-natsql.json files. However, I could not find these files, nor could I find instructions on how to generate them.

Could you provide guidance on how to generate dev-natsql.json and train_bird-natsql.json,

Thank you in advance for your assistance!

SQL2NatSQL code

would you like to release the SQL2NatSQL code? I'm looking forward to it, thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.