ruckbreasoning / resdsql Goto Github PK

The Pytorch implementation of RESDSQL (AAAI 2023).

Home Page: https://arxiv.org/abs/2302.05965

License: MIT License

Shell 6.33% Python 93.67%

resdsql's Introduction

RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL

This is the official implementation of the paper "RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL" (AAAI 2023).

If this repository could help you, please cite the following paper:

@inproceedings{li2022resdsql,
  author = {Haoyang Li and Jing Zhang and Cuiping Li and Hong Chen},
  title = "RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL",
  booktitle = "AAAI",
  year = "2023"
}

Update (2023.3.13): We evaluated our method on a diagnostic evaluation benchmark, Dr.Spider, which contains 17 test sets to measure the robustness of Text-to-SQL parsers under different perturbation perspectives.

Update (2023.5.19): We added support for CSpider, a Chinese Text-to-SQL benchmark with Chinese questions, English database schema, and corresponding SQL queries.

Update (2023.11.1): We are excited to present our text-to-SQL demo, available at https://github.com/RUCKBReasoning/text2sql-demo. This demo showcases the capabilities of our newly developed pre-trained language model CodeS, which has been specifically tailored for text-to-SQL tasks. Additionally, we have included comprehensive instructions on how to build the demo using your databases. We encourage you to experiment with it and explore its features! 🔥

Update (2024.4.12): We are thrilled to announce that our latest work, CodeS, has been accepted by SIGMOD 2024. CodeS represents a significant advancement over RESDSQL, incorporating a more powerful language model. We also re-developed the schema filter to make it easy to use. For an in-depth look at CodeS, please consult our paper available at CodeS-paper. Additionally, we have made the source code publicly available at CodeS-code for community use and feedback.

Update (2024.4.19): We are excited to announce the release of our newly developed schema filter, boasting 3 billion parameters and offering bilingual support for both Chinese and English. This tool is now available as an independent component and can be accessed at text2sql-schema-filter. If you're looking to enhance your text-to-SQL system with a schema filter, we encourage you to give it a try.

Overview

We introduce a new Text-to-SQL parser, RESDSQL (Ranking-enhanced Encoding plus a Skeleton-aware Decoding framework for Text-to-SQL), which attempts to decoulpe the schema linking and the skeleton parsing to reduce the difficulty of Text-to-SQL. More details can be found in our paper. All experiments are conducted on a single NVIDIA A100 80G GPU.

Evaluation Results

We evaluate RESDSQL on six benchmarks: Spider, Spider-DK, Spider-Syn, Spider-Realistic, Dr.Spider, and CSpider. We adopt two metrics: Exact-set-Match accuracy (EM) and EXecution accuracy (EX). Let's look at the following numbers:

On Spider:

Model	Dev EM	Dev EX	Test EM	Test EX
RESDSQL-3B+NatSQL	80.5%	84.1%	72.0%	79.9%
RESDSQL-3B	78.0%	81.8%	-	-
RESDSQL-Large+NatSQL	76.7%	81.9%	-	-
RESDSQL-Large	75.8%	80.1%	-	-
RESDSQL-Base+NatSQL	74.1%	80.2%	-	-
RESDSQL-Base	71.7%	77.9%	-	-

On Spider-DK, Spider-Syn, and Spider-Realistic:

Model	DK EM	DK EX	Syn EM	Syn EX	Realistic EM	Realistic EX
RESDSQL-3B+NatSQL	53.3%	66.0%	69.1%	76.9%	77.4%	81.9%

On Dr.Spider's perturbation sets: Following Dr.Spider, we only report EX for each post-perturbation set and choose PICARD and CodeX as our baseline methods.

Perturbation set	PICARD	CodeX	RESDSQL-3B	RESDSQL-3B+NatSQL
DB-Schema-synonym	56.5%	62.0%	63.3%	68.3%
DB-Schema-abbreviation	64.7%	68.6%	64.5%	70.0%
DB-DBcontent-equivalence	43.7%	51.6%	40.3%	40.1%
NLQ-Keyword-synonym	66.3%	55.5%	67.5%	72.4%
NLQ-Keyword-carrier	82.7%	85.2%	86.7%	83.5%
NLQ-Column-synonym	57.2%	54.7%	57.4%	63.1%
NLQ-Column-carrier	64.9%	51.1%	69.9%	63.9%
NLQ-Column-attribute	56.3%	46.2%	58.8%	71.4%
NLQ-Column-value	69.4%	71.4%	73.4%	76.6%
NLQ-Value-synonym	53.0%	59.9%	53.8%	53.2%
NLQ-Multitype	57.1%	53.7%	60.1%	60.7%
NLQ-Others	78.3%	69.7%	77.3%	79.0%
SQL-Comparison	68.0%	66.9%	70.2%	82.0%
SQL-Sort-order	74.5%	57.8%	79.7%	85.4%
SQL-NonDB-number	77.1%	89.3%	83.2%	85.5%
SQL-DB-text	65.1%	72.4%	67.8%	74.3%
SQL-DB-number	85.1%	79.3%	85.4%	88.8%
Average	65.9%	64.4%	68.2%	71.7%

Notice: We also employed the modified test suite script (see this issue) to evaluate the model-generated results, but obtained the same numbers as above. Nevertheless, we suggest that further work should use their modified script to evaluate Dr.Spider.

On CSpider's development set:

Model	EM	EXEC
RESDSQL-3B+NatSQL	66.3%	81.1%
RESDSQL-Large+NatSQL	64.3%	81.1%
LGESQL + GTL + Electra + QT	64.0%	-
LGESQL + ELECTRA + QT	64.5%	-
RESDSQL-Base+NatSQL	61.7%	78.1%

Prerequisites

Create a virtual anaconda environment:

conda create -n your_env_name python=3.8.5

Active it and install the cuda version Pytorch:

conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch

Install other required modules and tools:

pip install -r requirements.txt
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
python nltk_downloader.py

Create several folders:

mkdir eval_results
mkdir models
mkdir tensorboard_log
mkdir third_party
mkdir predictions

Clone evaluation scripts:

cd third_party
git clone https://github.com/ElementAI/spider.git
git clone https://github.com/ElementAI/test-suite-sql-eval.git
mv ./test-suite-sql-eval ./test_suite
cd ..

Prepare data

Download data (including Spider, Spider-DK, Spider-Syn, Spider-Realistic, Dr.Spider, and CSpider) and database and then unzip them:

unzip data.zip
unzip database.zip

Notice: Dr.Spider has been preprocessed following the instructions on its Github page.

Inference

All evaluation results can be easily reproduced through our released scripts and checkpionts.

Step1: Prepare Checkpoints

Because RESDSQL is a two-stage algorithm, therefore, you should first download cross-encoder checkpoints. Here are links:

Cross-encoder Checkpoints	Google Drive	Baidu Netdisk
text2natsql_schema_item_classifier	Link	Link (pwd: 18w8)
text2sql_schema_item_classifier	Link	Link (pwd: dr62)
xlm_roberta_text2natsql_schema_item_classifier (trained on CSpider)	-	Link (pwd: 3sdu)

Then, you should download T5 (for Spider) or mT5 (for CSpider) checkpoints:

T5/mT5 Checkpoints	Google Drive/OneDrive	Baidu Netdisk
text2natsql-t5-3b	OneDrive link	Link (pwd: 4r98)
text2sql-t5-3b	Google Drive link	Link (pwd: sc62)
text2natsql-t5-large	Google Drive link	Link (pwd: 7iyq)
text2sql-t5-large	Google Drive link	Link (pwd: q58k)
text2natsql-t5-base	Google Drive link	Link (pwd: pyxf)
text2sql-t5-base	Google Drive link	Link (pwd: wuek)
text2natsql-mt5-xl-cspider (trained on CSpider)	-	Link (pwd: y7ei)
text2natsql-mt5-large-cspider (trained on CSpider)	-	Link (pwd: ydqk)
text2natsql-mt5-base-cspider (trained on CSpider)	-	Link (pwd: d8b8)

The checkpoints should be placed in the models folder.

For CSpider, we only provide the NatSQL version because its performance is better than SQL in our pre-experiments. To support CSpider, we replace roberta-large with xlm-roberta-large in the first stage and replace t5 with mt5 in the second stage.

Step2: Run Inference

The inference scripts are located in scripts/inference. Concretely, infer_text2natsql.sh is the inference script of RESDSQL-{Base, Large, 3B}+NatSQL, and infer_text2sql.sh is the inference script of RESDSQL-{Base, Large, 3B}. For example, you can run the inference of RESDSQL-3B+NatSQL on Spider's dev set via:

sh scripts/inference/infer_text2natsql.sh 3b spider

The first argument (model scale) can be selected from [base, large, 3b] and the second argument (dataset name) can be selected from [spider, spider-realistic, spider-syn, spider-dk, DB_schema_synonym, DB_schema_abbreviation, DB_DBcontent_equivalence, NLQ_keyword_synonym, NLQ_keyword_carrier, NLQ_column_synonym, NLQ_column_carrier, NLQ_column_attribute, NLQ_column_value, NLQ_value_synonym, NLQ_multitype, NLQ_others, SQL_comparison, SQL_sort_order, SQL_NonDB_number, SQL_DB_text, SQL_DB_number].

The predicted SQL queries are recorded in predictions/{dataset_name}/{model_name}/pred.sql.

Inference on CSpider's Dev Set (New Feature) We also provide inference scripts to run RESDSQL-{Base, Large, 3B}+NatSQL on CSpider's development set. Here is an example:

sh scripts/inference/infer_text2natsql_cspider.sh 3b

The first argument (model scale) can be selected from [base, large, 3b].

Training on Spider

We provide scripts in scripts/train/text2natsql and scripts/train/text2sql to train RESDSQL on Spider's training set and evaluate on Spider's dev set.

RESDSQL-{Base, Large, 3B}+NatSQL

# Step1: preprocess dataset
sh scripts/train/text2natsql/preprocess.sh
# Step2: train cross-encoder
sh scripts/train/text2natsql/train_text2natsql_schema_item_classifier.sh
# Step3: prepare text-to-natsql training and development set for T5
sh scripts/train/text2natsql/generate_text2natsql_dataset.sh
# Step4: fine-tune T5-3B (RESDSQL-3B+NatSQL)
sh scripts/train/text2natsql/train_text2natsql_t5_3b.sh
# Step4: (or) fine-tune T5-Large (RESDSQL-Large+NatSQL)
sh scripts/train/text2natsql/train_text2natsql_t5_large.sh
# Step4: (or) fine-tune T5-Base (RESDSQL-Base+NatSQL)
sh scripts/train/text2natsql/train_text2natsql_t5_base.sh

RESDSQL-{Base, Large, 3B}

# Step1: preprocess dataset
sh scripts/train/text2sql/preprocess.sh
# Step2: train cross-encoder
sh scripts/train/text2sql/train_text2sql_schema_item_classifier.sh
# Step3: prepare text-to-sql training and development set for T5
sh scripts/train/text2sql/generate_text2sql_dataset.sh
# Step4: fine-tune T5-3B (RESDSQL-3B)
sh scripts/train/text2sql/train_text2sql_t5_3b.sh
# Step4: (or) fine-tune T5-Large (RESDSQL-Large)
sh scripts/train/text2sql/train_text2sql_t5_large.sh
# Step4: (or) fine-tune T5-Base (RESDSQL-Base)
sh scripts/train/text2sql/train_text2sql_t5_base.sh

During training, the cross-encoder (i.e., the first stage) always keeps the best checkpoint, but T5 (i.e., the second stage) keeps all the intermediate checkpoints, because different test sets may achieve the best Text-to-SQL performance on different checkpoints. Therefore, given a test set, we need to evaluate all the intermediate checkpoints and compare their performance to find the best checkpoint. The evaluation results of checkpoints are saved in eval_results.

Our paper also report the performence of RESDSQL-3B+NatSQL (the most powerful version of RESDSQL) on Spider-DK, Spider-Syn, and Spider-Realistic. To obtain results on these datasets, we provide evaluation scripts in scripts/evaluate_robustness. Here is an example for Spider-DK:

# Step1: preprocess Spider-DK
sh scripts/evaluate_robustness/preprocess_spider_dk.sh
# Step2: Run evaluation on Spider-DK
sh scripts/evaluate_robustness/evaluate_on_spider_dk.sh

Training on CSpider

We additionally provide scripts in scripts/train/cspider_text2natsql and scripts/train/cspider_text2sql to train RESDSQL on CSpider's training set and evaluate on CSpider's dev set.

RESDSQL-{Base, Large, 3B}+NatSQL (CSpider version)

# Step1: preprocess CSpider
sh scripts/train/cspider_text2natsql/preprocess.sh
# Step2: train cross-encoder
sh scripts/train/cspider_text2natsql/train_text2natsql_schema_item_classifier.sh
# Step3: prepare text-to-natsql training and development set for mT5
sh scripts/train/cspider_text2natsql/generate_text2natsql_dataset.sh
# Step4: fine-tune mT5-XL (RESDSQL-3B+NatSQL)
sh scripts/train/cspider_text2natsql/train_text2natsql_mt5_xl.sh
# Step4: (or) fine-tune mT5-Large (RESDSQL-Large+NatSQL)
sh scripts/train/cspider_text2natsql/train_text2natsql_mt5_large.sh
# Step4: (or) fine-tune mT5-Base (RESDSQL-Base+NatSQL)
sh scripts/train/cspider_text2natsql/train_text2natsql_mt5_base.sh

In order to train the NatSQL version on CSpider, we manually aligned and modified annotations of NatSQL. The aligned files are also released, see NatSQL/NatSQLv1_6/train_cspider-natsql.json and NatSQL/NatSQLv1_6/dev_cspider-natsql.json.

RESDSQL-{Base, Large, 3B} (CSpider version)

# Step1: preprocess CSpider
sh scripts/train/cspider_text2sql/preprocess.sh
# Step2: train cross-encoder
sh scripts/train/cspider_text2sql/train_text2sql_schema_item_classifier.sh
# Step3: prepare text-to-sql training and development set for mT5
sh scripts/train/cspider_text2sql/generate_text2sql_dataset.sh
# Step4: fine-tune mT5-XL (RESDSQL-3B)
sh scripts/train/cspider_text2sql/train_text2sql_mt5_xl.sh
# Step4: (or) fine-tune mT5-Large (RESDSQL-Large)
sh scripts/train/cspider_text2sql/train_text2sql_mt5_large.sh
# Step4: (or) fine-tune mT5-Base (RESDSQL-Base)
sh scripts/train/cspider_text2sql/train_text2sql_mt5_base.sh

Acknowledgements

We would thanks to Hongjin Su and Tao Yu for their help in evaluating our method on Spider's test set. We would also thanks to PICARD (paper, code), NatSQL (paper, code), Spider (paper, dataset), Spider-DK (paper, dataset), Spider-Syn (paper, dataset), Spider-Realistic (paper, dataset), Dr.Spider (paper, dataset), and CSpider (paper, dataset) for their interesting work and open-sourced code and dataset.

resdsql's People

Contributors

Stargazers

Watchers

resdsql's Issues

能处理join 多表的情况吗

请问，能处理多个表的连接查询吗？

请问如果要自己准备dataset做训练或者测试，有什么格式要求吗？

请问如果要自己准备dataset，有什么格式要求吗？

SQL句中的value是如何确定的？

以spider、Cspider为例，一条生成的SQL语句，包含sql语法（骨架）、字段信息、值信息（value），比如：在”计算年龄大于50岁的男性人口”中，50就是值。
我的问题是，值是如何确定的？具体而言，从自然语言问句中分辨出哪些是值（比如，从“年龄大于50岁的男性人口”中，分辨出50是值），是模型通过训练获得的能力，还是spder、Cspider已经预先界定了哪些是值？如果是通过训练获得了分辨值的能力，能简单介绍一下思路吗？

Dev.json file

Hi,
I want to train the model using my own dataset and I saw in another thread that the dev.json file is required for this. Could you elaborate on how the dev.json file should be formatted, given some query and a database schema?

Best,
Adam

How I can do inference on the model only with a question?

Hi, I want to know if is possible to use the model to get directly the SQL statment, after giving it a question in natural language. Actually if for example, I use the dev.json (modified version from spider dataset) attached, I have no result in pred.sql.
Thank you for help.

Inference Run Killed

when i run the Inference the process is killed but i don't know why.

What stops the process and what can i do to fix it?

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

请问一下 label标签为什么每次只取前面四个（在prepare_batch_inputs_and_labels方法中），比如我tables.json总共设置了十张表，但是前面四个算法后的label都是0，只有后面6个label为1，经过循环之后每次只在列表中追加返回的都是前面四个label就全为0，就报了该错误，后面我调整整了tables.json中的表的顺序是可以了，后面评估schema_item_classifier又报同样的错误了？是样本不均衡的问题吗？还是说train.json中的question对应的sql查询使用的表只能在tables.json中，没有用到的就不用写进去吗？有人遇到过这样的问题吗？我是使用自己的数据集，训练集结构已经跟要求的一致。

CSpider上加不加NatSQL的性能差异有多大？

想请问一下在Cspider上，mT5-base模型不加NatSQL的性能比加了NatSQL的差几个点呀？

How can we optimize the Model Inference time. Single NLQ taking more than a 45seconds.

Hello everyone,

I hope you're doing well. I encountered an issue while using Fine-Tuned RESDSQL on my dataset(spider-like) for predicting SQL .The inference time goes around one minute for it. While profiling the steps I found that schema_item_calssifier.py and text2sql.py are taking majority of the time. I would greatly appreciate any suggestions or insights on optimizing/minimizing the prediction time.
Thank you in advance for your assistance!

How can I finetune on CSpider

Can‘t find the file nltk_downloader.py

Thanks for your nice work! I can't find the file nltk_downloader.py which you mentioned in the file readme.md . Could you please offer it for me？Thank you.

Obtaining query_toks_no_value from query

I'm attempting to try training on another dataset by appending the dataset's train.json, dev.json, and tables.json to Spider's (and adding the database into Spider's too) with RESDSQL. I'm a bit stumped on how to generate the query_toks_no_value from a query. Is there a script for this, or do you have any advice on how to make one?

Requirements

When I run the requirements file.

I am getting this error.
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for jarowinkler
Failed to build rapidfuzz tokenizers jarowinkler
ERROR: Could not build wheels for rapidfuzz, tokenizers, jarowinkler, which is required to install pyproject.toml-based projects

can you please tell me how do i install this which are on requirement list?

Low accuracy in predicting SQL using RESDSQL on my dataset

Hello everyone,

I hope you're doing well. I encountered an issue while using RESDSQL for predicting SQL on my dataset. Despite following all the recommended steps, I'm observing an accuracy range of only 30-40%. I would greatly appreciate any suggestions or insights on increasing the predictions' accuracy.

Thank you in advance for your assistance!

如何在测试集上添加一个新的表，还需要额外添加信息表信息

inference scripts error

你好，我在尝试使用模型推理的时候出现了一些问题：
我使用的模型是RESDSQL-base, 在前期工作准备完成后使用了sh scripts/inference/infer_text2sql.sh base spider 指令进行推理，出现了如下错误：

Traceback (most recent call last):
File "schema_item_classifier.py", line 463, in
total_table_pred_probs, total_column_pred_probs = _test(opt)
File "schema_item_classifier.py", line 428, in _test
batch_column_number_in_each_table
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 191, in forward
batch_column_number_in_each_table
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 134, in table_column_cls
output_t, (hidden_state_t, cell_state_t) = self.table_name_bilstm(table_name_embeddings)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 689, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 632, in check_forward_args
self.check_input(input, batch_sizes)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 203, in check_input
expected_input_dim, input.dim()))
RuntimeError: input must have 3 dimensions, got 2

于是我在./untils/classifier_model.py的line 134 加入：

print(table_name_embeddings.size(),table_name_embeddings)
table_name_embeddings = table_name_embeddings.unsqueeze(0)
print(table_name_embeddings.size(),table_name_embeddings)

后续报错：

torch.Size([1, 1024]) tensor([[-0.3795, -0.9529, 0.9007, ..., -0.6501, -2.1801, 0.9587]],
device='cuda:0')
torch.Size([1, 1, 1024]) tensor([[[-0.3795, -0.9529, 0.9007, ..., -0.6501, -2.1801, 0.9587]]],
device='cuda:0')
torch.Size([1, 1024]) tensor([[-0.7597, -0.5682, -0.4270, ..., 0.3219, 1.5417, 0.3518]],
device='cuda:0')
torch.Size([1, 1, 1024]) tensor([[[-0.7597, -0.5682, -0.4270, ..., 0.3219, 1.5417, 0.3518]]],
device='cuda:0')
torch.Size([1, 1024]) tensor([[-0.4921, -1.1286, 0.9307, ..., -0.5373, -2.0887, 0.9216]],
device='cuda:0')
torch.Size([1, 1, 1024]) tensor([[[-0.4921, -1.1286, 0.9307, ..., -0.5373, -2.0887, 0.9216]]],
device='cuda:0')
torch.Size([3, 1024]) tensor([[-0.5896, -1.3575, 1.1120, ..., -0.6104, -1.9414, 0.6679],
[-0.6831, -1.3711, 1.1447, ..., -0.5117, -2.0709, 0.8956],
[-0.6337, -1.3548, 1.2228, ..., -0.4896, -2.0505, 0.8417]],
device='cuda:0')
torch.Size([1, 3, 1024]) tensor([[[-0.5896, -1.3575, 1.1120, ..., -0.6104, -1.9414, 0.6679],
[-0.6831, -1.3711, 1.1447, ..., -0.5117, -2.0709, 0.8956],
[-0.6337, -1.3548, 1.2228, ..., -0.4896, -2.0505, 0.8417]]],
device='cuda:0')
0%| | 0/33 [00:01<?, ?it/s]
Traceback (most recent call last):
File "schema_item_classifier.py", line 462, in
total_table_pred_probs, total_column_pred_probs = _test(opt)
File "schema_item_classifier.py", line 427, in _test
batch_column_number_in_each_table
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 194, in forward
batch_column_number_in_each_table
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 138, in table_column_cls
table_name_embedding = hidden_state_t[-2:, :].view(1, 1024)
RuntimeError: shape '[1, 1024]' is invalid for input of size 3072

对于修改这些错误，需要一些帮助。感谢🙏

Timestamp Functionality

Hey, I would like to know what to do, to train the model with data so that it supports timestamps functionality as like in Druid SQL interface for example. Is there a way to do add timestamps functionality as well.

Dataset used for finetuning mt5 model

Hi
First of all, thank you for your great work on this project. You've reached among best results on Spider benchmark and your clear and complete readme file allowed me to run your code very easily.

I want to see if I can finetune a text2natsql model on mt5 like you did on CSpider. I was wondering how much data I have to create as I want to create a dataset like CSpider but in Persian languge.

Was CSpider the only dataset used for finetuning mt5 backbone or other datasets were also used?

XLM-ROBERTA-LARGE做分类的模型如何多卡运行？

试着尝试了一下多卡运行，
model = nn.DataParallel(model, device_ids=devices)
model.to(device)
结果会报bug，
Traceback (most recent call last):
File "schema_item_classifier_gpus.py", line 470, in
_train(opt)
File "schema_item_classifier_gpus.py", line 287, in _train
loss = encoder_loss_func.compute_loss(
File "/workspace/RESDSQL/utils/classifier_loss.py", line 60, in compute_loss
table_loss = self.compute_batch_loss(batch_table_name_cls_logits, batch_table_labels, batch_size)
File "/workspace/RESDSQL/utils/classifier_loss.py", line 47, in compute_batch_loss
loss += self.focal_loss(logits, labels)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/RESDSQL/utils/classifier_loss.py", line 16, in forward
assert input_tensor.shape[0] == target_tensor.shape[0]
是由于这个分类模型的结构设计，无法实现多卡运行吗？

How can I expose it as a API Service ?

Both the steps should be part of the service.

checkpoint下载失败

您好，请问一下 T5 checkpoints 怎么下载？表格里的google drive的两个link下载下来是两个文件夹：
text2sql_schema_item_classifier
text2natsql_schema_item_classifier

没有看到text2natsql-t5-3b，text2natsql-t5-base等目录，请问一下这些是怎么出来的？是我下载的不对吗？

Is there a distributed version of the code? I'd like to reproduce the effect of t5

If there is a distributed version available, please kindly inform us on how to use it, as not all research centers have the same resources as Renmin University. Thank you.

The SQL skeleton is a too easy objective

I have trained 10 epochs in Spider with seq-to-seq framework model. If the target objective just as original SQL, the results is about 60%+. But when switching to skeleton+SQL, the performance is so bad.

After manual check, I found that the model inference result only contains the skeleton, and there is no SQL at all. Have you ever encountered this problem?

请问我如何更好的理解文中提出的cross-encoder

在解码的时候，做了哪些后处理

我想知道在解码的时候做了哪些后处理，有具体的步骤么
pred_natsql = fix_fatal_errors_in_natsql(pred_natsql, batch_tc_original[batch_id])
if old_pred_natsql != pred_natsql:
print("Before fix:", old_pred_natsql)
print("After fix:", pred_natsql)
print("---------------")
pred_sql = natsql_to_sql(pred_natsql, db_id, db_file_path, table_dict[db_id]).strip()
因为我发现在不进行后处理，直接解码的效果很差

这text2sql.py阶段所有的验证集都出现["sql placeholder"]是什么原因

sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.33s/it]
2023-03-01 14:17:33,138 INFO root 输出结果
['sql placeholder']

这是用单例测试，用的32G V100

Running evaluate_robustness returns nothing

Hello, I've attached a screenshot below to better highlight this issue.

For some reason, running the following command sh scripts/evaluate_robustness/evaluate_on_spider_realistic.sh generates nothing on the eval_results directory. I could see the folders and the .txt file being generated but for some reason, nothing is being appended to the said document. It is worth noting that I have ran the pre-processing scripts in advance and every command and the pre-process command already as well sh scripts/evaluate_robustness/preprocess_spider_realistic.sh

inference cspider in 3b t5

OSError: Error no file named pytorch_model.bin found in directory ./models/text2natsql-mt5-xl-cspider/checkpoint-167433 but there is a file for Flax weights. Use from_flax=True to load this model from those weights.

貌似是因为transformers==4.17.0不支持分片的模型（https://discuss.huggingface.co/t/flan-t5-xl-model-does-not-appear-to-have-a-file-named-pytorch-model-bin/30395）

The repository does not have a license

Thank you so much for your wonderful work.

Currently, the repository does not have a license. According to the github documentation

You're under no obligation to choose a license. However, without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work. If you're creating an open source project, we strongly encourage you to include an open source license.

Do you think you could add an open source license to the repository, so that other people are legally allowed to reproduce, distribute, or create derivative works from it?

More discussion on this matter

schema_item_classifier.py中column_number_in_each_table定义问题

您好，在看您的代码的时候发现，在schema_item_classifier.py文件中第156-162行有关于batch_column_number_in_each_table更新的定义，但是借助了table_labels和colum_lables的信息，且看后面代码中这个batch_column_number_in_each_table会作为一个参数输入模型进行推理，那么在没有labels的情况下，这个参数需要怎么定义呢？

Error in Running Inference script

 raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/text2sql-t5-base/checkpoint-39312'. Use `repo_type` argument if needed.

I am also attaching the entire output log:

RESDSQL.txt

sh scripts/inference/infer_text2natsql_cspider.sh 3b出错

NatSQL-Parser

Hey,
I'm very interested in your work. I want to train RESDSQL+NatSQL on my own dataset. I had no problems to train only RESDSQL but I don't know how to create the NatSQL-JSON-file. Do you know if there is any script available for parsing SQL-queries into NatSQL?
Thanks in advance!

schema_item_classifier.py如何对新增库中的表名和列名进行预测

schema_item_classifier.py 是一个选表选列的模型，如何对一个新增的库中的表和列进行选取，预先并不知道新增苦衷有多少表和列数据

CSpider 训练bash好像有错误，同时不完整

./scripts/train/cspider_text2natsql/generate_text2natsql_dataset.sh 里面存在如下两个问题（相同情况在 cspider_text2sql也有）：

line 4, text2sql_data_generator.py 的 input_dataset_path 应为带有列、表概率的 train_cspider_with_probs_natsql.json；
缺少对训练数据运行schema_item_classifier.py，写在line 4的 preprocessed_train_cspider_natsql.json 是该模型的输入才对。

对中文的支持

目前看示例代码中使用的模型和数据集均是来自于英文，自测了一下也确实对中文的支持还不好。想请问一下，如果想移植到中文环境使用，是需要把训练使用的RoBERTa模型、T5模型、训练数据集都换成中文的是吧？大概在网上找了一下，也找了几个对应的模型和数据集，请问下研发团队之前做过类似的尝试吗，有没有遇到什么困难或者障碍？

我找到的几个中文模型及数据集资源：
https://github.com/brightmart/roberta_zh
https://github.com/SunnyGJing/t5-pegasus-chinese
https://taolusi.github.io/CSpider-explorer/

No matching distribution found for spacy==2.2.3

conda create -n your_env_name python=3.8.5
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt

when install requirements.txt
will print those error message

Collecting spacy==2.2.3
  Using cached spacy-2.2.3.tar.gz (5.9 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /home/studio-lab-user/.conda/envs/studiolab/bin/python3.9 /home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpnbsyo1tb
       cwd: /tmp/pip-install-5hq5ozgn/spacy_c569f2d7ab7a48d689e1bd1e3adaedf5
  Complete output (49 lines):
  Traceback (most recent call last):
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/requirements.py", line 35, in __init__
      parsed = _parse_requirement(requirement_string)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 64, in parse_requirement
      return _parse_requirement(Tokenizer(source, rules=DEFAULT_RULES))
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 82, in _parse_requirement
      url, specifier, marker = _parse_requirement_details(tokenizer)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 126, in _parse_requirement_details
      marker = _parse_requirement_marker(
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 147, in _parse_requirement_marker
      tokenizer.raise_syntax_error(
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_tokenizer.py", line 165, in raise_syntax_error
      raise ParserSyntaxError(
  setuptools.extern.packaging._tokenizer.ParserSyntaxError: Expected end or semicolon (after version specifier)
      spacy_lookups_data>=0.0.5<0.2.0
                        ~~~~~~~^
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 349, in <module>
      main()
    File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 331, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 117, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 338, in run_setup
      exec(code, locals())
    File "<string>", line 200, in <module>
    File "<string>", line 190, in setup_package
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 106, in setup
      _install_setup_requires(attrs)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 77, in _install_setup_requires
      dist.parse_config_files(ignore_option_errors=True)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 900, in parse_config_files
      self._finalize_requires()
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 596, in _finalize_requires
      self._convert_extras_requirements()
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 611, in _convert_extras_requirements
      for r in _reqs.parse(v):
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/requirements.py", line 37, in __init__
      raise InvalidRequirement(str(e)) from e
  setuptools.extern.packaging.requirements.InvalidRequirement: Expected end or semicolon (after version specifier)
      spacy_lookups_data>=0.0.5<0.2.0
                        ~~~~~~~^
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/b7/f2/052bfe5861761599b5421916aba3eb0064d83145ff3072390ecdc5a836de/spacy-2.2.3.tar.gz#sha256=1d14c9e7d65b2cecd56c566d9ffac8adbcb9ce2cff2274cbfdcf5468cd940e6a (from https://pypi.org/simple/spacy/) (requires-python:!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7). Command errored out with exit status 1: /home/studio-lab-user/.conda/envs/studiolab/bin/python3.9 /home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpnbsyo1tb Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement spacy==2.2.3 (from versions: 0.31, 0.32, 0.33, 0.40, 0.51, 0.52, 0.60, 0.61, 0.62, 0.63, 0.64, 0.65, 0.67, 0.68, 0.70, 0.80, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.97, 0.98, 0.99, 0.100.0, 0.100.1, 0.100.2, 0.100.3, 0.100.4, 0.100.5, 0.100.6, 0.100.7, 0.101.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.7.2, 1.7.3, 1.7.5, 1.8.0, 1.8.1, 1.8.2, 1.9.0, 1.10.0, 1.10.1, 2.0.0, 2.0.1.dev0, 2.0.1, 2.0.2.dev0, 2.0.2, 2.0.3.dev0, 2.0.3, 2.0.4.dev0, 2.0.4, 2.0.5.dev0, 2.0.5, 2.0.6.dev0, 2.0.6, 2.0.7, 2.0.8, 2.0.9, 2.0.10.dev0, 2.0.10, 2.0.11.dev0, 2.0.11, 2.0.12.dev0, 2.0.12.dev1, 2.0.12, 2.0.13.dev0, 2.0.13.dev1, 2.0.13.dev2, 2.0.13.dev4, 2.0.13, 2.0.14.dev0, 2.0.14.dev1, 2.0.15, 2.0.16.dev0, 2.0.16, 2.0.17.dev0, 2.0.17.dev1, 2.0.17, 2.0.18.dev0, 2.0.18.dev1, 2.0.18, 2.1.0, 2.1.1.dev0, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.1.5, 2.1.6, 2.1.7.dev0, 2.1.7, 2.1.8, 2.1.9, 2.2.0.dev10, 2.2.0.dev11, 2.2.0.dev13, 2.2.0.dev15, 2.2.0.dev17, 2.2.0.dev18, 2.2.0.dev19, 2.2.0, 2.2.1, 2.2.2.dev0, 2.2.2.dev4, 2.2.2, 2.2.3.dev0, 2.2.3, 2.2.4, 2.3.0.dev1, 2.3.0, 2.3.1, 2.3.2, 2.3.3.dev0, 2.3.3, 2.3.4, 2.3.5, 2.3.6, 2.3.7, 2.3.8, 2.3.9, 3.0.0, 3.0.1.dev0, 3.0.1, 3.0.2, 3.0.3, 3.0.4, 3.0.5, 3.0.6, 3.0.7, 3.0.8, 3.0.9, 3.1.0, 3.1.1, 3.1.2, 3.1.3, 3.1.4, 3.1.5, 3.1.6, 3.1.7, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4, 3.2.5, 3.2.6, 3.3.0.dev0, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.4.4, 3.5.0, 3.5.1, 3.5.2, 3.5.3, 3.5.4, 3.6.0.dev0, 3.6.0.dev1, 3.6.0, 3.7.0.dev0, 4.0.0.dev0, 4.0.0.dev1)
ERROR: No matching distribution found for spacy==2.2.3

First time I install in windows 10
then try to implement in AWS SageMaker Studio Lab which is like Google Colab
and also has same problem

I try to install spacy 3.0.0
Can install success
But execute shell script infer_text2natsql.sh have another problem

(studiolab) studio-lab-user@default:~/sagemaker-studiolab-notebooks/RESDSQL$ sh scripts/inference/infer_text2natsql.sh 3b spider
/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py:715: UserWarning: [W094] Model 'en_core_web_sm' (2.2.0) specifies an under-constrained spaCy version requirement: >=2.2.0. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.0.0,<3.1.0
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "/home/studio-lab-user/sagemaker-studiolab-notebooks/RESDSQL/NatSQL/table_transform.py", line 885, in <module>
    _tokenizer = get_spacy_tokenizer()
  File "/home/studio-lab-user/sagemaker-studiolab-notebooks/RESDSQL/NatSQL/natsql2sql/preprocess/TokenString.py", line 249, in get_spacy_tokenizer
    nlp = spacy.load("en_core_web_sm")
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/__init__.py", line 47, in load
    return util.load_model(name, disable=disable, exclude=exclude, config=config)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 322, in load_model
    return load_model_from_package(name, **kwargs)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 355, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/en_core_web_sm/__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 514, in load_model_from_init_py
    return load_model_from_path(
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 388, in load_model_from_path
    config = load_config(config_path, overrides=dict_to_dot(config))
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 545, in load_config
    raise IOError(Errors.E053.format(path=config_path, name="config.cfg"))
OSError: [E053] Could not read config.cfg from /home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/en_core_web_sm/en_core_web_sm-2.2.0/config.cfg

Is there anyone meet same problem?
Or any problem in my environment

natsql是如何转换的？如何取骨架？

我注意到，作者在另一个问题里提到过，似乎是sql-to-natsql的代码没有开源，所以作者是直接使用了已经转换后的数据集吗？（Cspider也有现成的数据集？）
同时，skeleton-aware的decoder，前半部分是sql骨架，后半部分是sql，那对于natsql来说，是否可以理解为：前半部分是natsql骨架，后半部分是natsql？那么，natsql的骨架是怎么处理的呢？

请问代码中用到的text2sql-t5-3b是直接从hugging face下载的吗？没看到你在哪里下载的

TypeError: 'datetime.datetime' object is not subscriptable

File "F:\python_project\RESDSQL\NatSQL\natsql2sql\preprocess\db_match.py", line 194, in db_col_type_check
if skip_once and len(values) > 7 and not v[0][0].isdigit():
TypeError: 'datetime.datetime' object is not subscriptable

就是说query = "select distinct "+col[1]+" from " + self.table_list[table_idx] + " order by "+col[1]+" limit 500" 这条语句查询出来的[()....()]是'datetime.datetime' 数据类型的就会报错，难道只能是sqlite这种没有datetime类型时间类型的数据库吗？

Error in running Inference script

I am trying to run the inference script but getting TypeError: expected str, bytes or os.PathLike object, not NoneType

I am also attaching the entire output log.
New Text Document (2).txt

ModuleNotFoundError: No module named 'third_party.spider'

Traceback (most recent call last):
File "/Users/piranavs/hay_test/resdsql/text2sql.py", line 16, in
from utils.spider_metric.evaluator import EvaluateTool
File "/Users/piranavs/hay_test/resdsql/utils/spider_metric/evaluator.py", line 4, in
from third_party.spider.preprocess.get_tables import dump_db_json_schema
ModuleNotFoundError: No module named 'third_party.spider'

I am getting this error during inference. Am I doing something wrong?

Cspider不加natsql训练步骤，运行到第二步的时候报错：RuntimeError: input must have 3 dimensions, got 2

报错信息如下，我这边下载了xlm-roberta-large，放在base_models目录下的，下载地址：https://huggingface.co/xlm-roberta-large/tree/main
Namespace(add_fk_info=False, alpha=0.75, batch_size=4, dev_filepath='./data/preprocessed_data/preprocessed_dev_cspider_natsql.json', device='0', epochs=128, gamma=2.0, gradient_descent_step=2, learning_rate=1e-05, mode='train', model_name_or_path='./base_models/xlm-roberta-large', output_filepath='data/pre-processing/dataset_with_pred_probs.json', patience=4, save_path='./models/xlm_roberta_text2natsql_schema_item_classifier', seed=42, tensorboard_save_path='./tensorboard_log/xlm_roberta_text2natsql_schema_item_classifier', train_filepath='./data/preprocessed_data/preprocessed_train_cspider_natsql.json', use_contents=True)
Some weights of the model checkpoint at ./base_models/xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']

This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
This is epoch 1.
Traceback (most recent call last):
File "schema_item_classifier.py", line 463, in
_train(opt)
File "schema_item_classifier.py", line 277, in _train
batch_column_number_in_each_table
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/RESDSQL/utils/classifier_model.py", line 191, in forward
batch_column_number_in_each_table
File "/workspace/RESDSQL/utils/classifier_model.py", line 134, in table_column_cls
output_t, (hidden_state_t, cell_state_t) = self.table_name_bilstm(table_name_embeddings)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 677, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 620, in check_forward_args
self.check_input(input, batch_sizes)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 203, in check_input
expected_input_dim, input.dim()))
RuntimeError: input must have 3 dimensions, got 2

Thanks

请问可以适应其他数据集嘛？

请问这套程序可以适应其他的NL2SQL数据集嘛？比如DuSQL？