benbogin / spider-schema-gnn Goto Github PK
View Code? Open in Web Editor NEWAuthor implementation of the paper "Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing"
Author implementation of the paper "Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing"
Hi @benbogin ,
I was able to complete successful training and reproduce your benchmarks. Thanks for the excellent code and article.
While trying to predict using trained net with this command:
allennlp predict \ --include-package dataset_readers.spider \ --include-package models.semantic_parsing.spider_parser \ --output-file simple_test_output.json \ --cuda-device 0 \ --weights-file experiments/name_of_experiment/best.th \ experiments/name_of_experiment/model.tar.gz \ experiments/name_of_experiment/simple_test_input.json
I Received this error:
allennlp.common.checks.ConfigurationError: 'No default predictor for model type spider.\nPlease specify a predictor explicitly.'
Can you please advice with a simple example predicting even on the train/dev/val spider set?
command:allennlp train train_configs/paper_defaults.jsonnet -s experiments/paper_name_of_experiment --include-package dataset_readers.spider --include-package models.semantic_parsing.spider_parser
problem:
0it [00:00, ?it/s]2021-03-09 15:39:32,619 - INFO - dataset_readers.spider - Trying to load cache from cache\dev.json
175it [00:20, 8.34it/s]
Traceback (most recent call last):
File "e:\anaconda\envs\f\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "e:\anaconda\envs\f\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "E:\anaconda\envs\f\Scripts\allennlp.exe_main.py", line 7, in
File "e:\anaconda\envs\f\lib\site-packages\allennlp\run.py", line 18, in run
main(prog="allennlp")
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands_init.py", line 101, in main
args.func(args)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands\train.py", line 103, in train_model_from_args
args.force)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands\train.py", line 136, in train_model_from_file
return train_model(params, serialization_dir, file_friendly_logging, recover, force)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands\train.py", line 184, in train_model
pieces = TrainerPieces.from_params(params, serialization_dir, recover) # pylint: disable=no-member
File "e:\anaconda\envs\f\lib\site-packages\allennlp\training\trainer.py", line 739, in from_params
all_datasets = training_util.datasets_from_params(params)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\training\util.py", line 146, in datasets_from_params
validation_data = validation_and_test_dataset_reader.read(validation_data_path)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\data\dataset_readers\dataset_reader.py", line 73, in read
instances = [instance for instance in Tqdm.tqdm(instances)]
File "e:\anaconda\envs\f\lib\site-packages\allennlp\data\dataset_readers\dataset_reader.py", line 73, in
instances = [instance for instance in Tqdm.tqdm(instances)]
File "e:\anaconda\envs\f\lib\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File ".\dataset_readers\spider.py", line 93, in _read
ex = fix_number_value(ex)
File ".\dataset_readers\dataset_util\spider_utils.py", line 143, in fix_number_value
ex['query_toks'][i_val_end + 1].lower() != ex['query_toks_no_value'][i_no_val + 1].lower():
IndexError: list index out of range
I ran on 1 gpu with batch_size=13 and got cuda memory issue. It works if I reduce batch_size but your default batch_size is 15, how did you train, did you setup multiple gpus - how to do that?
Hi @benbogin
I tried to predict some queries. But when I run the predictor, I got an error:
Traceback (most recent call last): File "/home/qwy/anaconda3/bin/allennlp", line 10, in <module> sys.exit(run()) File "/home/qwy/anaconda3/lib/python3.7/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/qwy/anaconda3/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 100, in main import_submodules(package_name) File "/home/qwy/anaconda3/lib/python3.7/site-packages/allennlp/common/util.py", line 314, in import_submodules module = importlib.import_module(package_name) File "/home/qwy/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 728, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "./models/semantic_parsing/__init__.py", line 1, in <module> from models.semantic_parsing.spider_parser import SpiderParser File "./models/semantic_parsing/spider_parser.py", line 18, in <module> from torch_geometric.data import Data, Batch File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module> from .data import Data File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/data/data.py", line 4, in <module> from torch_geometric.utils import (contains_isolated_nodes, File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/utils/__init__.py", line 2, in <module> from .scatter import scatter_ File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/utils/scatter.py", line 1, in <module> import torch_scatter File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/__init__.py", line 3, in <module> from .mul import scatter_mul File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/mul.py", line 3, in <module> from torch_scatter.utils.ext import get_func File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/utils/ext.py", line 2, in <module> import torch_scatter.scatter_cpu
ImportError: /home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/scatter_cpu.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1019UndefinedTensorImpl10_singletonE
I still remember when I trained the models, I got the same problem with torch_scatter. But I just reinstalled torch_scatter and it worked. This time reinstalling doesn't work. Could you please help me?
Hi Bogin, thanks for open-sourcing the excellent code! I have learned a lot from your code :) However, I have one strange question about the entity string
and want to ask for your help.
From my personal viewpoint, the string
entity is tied with database values, right? (Refer to function read_dataset_values
and get_entities_from_question
). However, as far as I know, the Spider dataset does not allow to read values in database in test. So I guess there should be a gap between train and test: given a utterance, in training, we could somehow know whether one word is corresponding to the database values; But in test, we cannot even get the string
entity.
But from the small difference gap between train and test performance in the leaderboard, I do not know whether my guess is correct. So I want to know, will you OBTAIN the database values in test? If no, how to handle the above inconsistent problem? And what it the insight or motivation to read the dataset values? Thanks for your patience, looking forward to your response!
Hello,
I created my own database directory and added my schema.sql and <database_name>.sqlite file.
I added the values to tables.json as well.
But, when I am running the allennlp predict command
Please let me know if this is a known error and if there is a workaround.
We are afraid that the hard SQL like TABLE JOIN is the limit for industrial application.
Thank you very much.
I tried to remove query_toks_no_value
from the spider test data to run the predict but it's crashed. It makes no sense to include that field when running the prediction :S
File "/venv/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 76, in read
"Is the path correct?".format(file_path))
allennlp.common.checks.ConfigurationError: 'No instances were read from the given filepath /spider-schema-gnn/test/dev.json. Is the path correct?'
I used paper_defaults.jsonnet to reproduce the result of 0.40 in the paper, but with another configuration defaults.jsonnet ran for many times, the maximum reached 0.41 instead of 0.47. I am a little confused, can you help me approach the problem, thank you!
The datasets is that the table corresponding to question is given.
But in real industrial application, we have 100+ tables for 1 new question.
Thank you!
I tried to train the model on a CPU-based system, but I got the following error:
2019-06-03 10:10:17,299 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . usage: allennlp allennlp: error: unrecognized arguments: ./train.sh: line 5: --include-package: command not found
I cannot install apex since I don't have CUDA on my system. Is it possible to train the model on a CPU, or having a GPU is part of the repository requirement?
Thank for the help!
For some reason when I am training the model it is stopped with the error:
2019-07-02 13:16:56,613 - INFO - allennlp.training.trainer - Ran out of patience. Stopping training. 2019-07-02 13:16:56,614 - INFO - allennlp.training.checkpointer - loading best weights 2019-07-02 13:16:56,635 - INFO - allennlp.models.archival - archiving weights and vocabulary to experiments/name_of_experiement/model.tar.gz 2019-07-02 13:16:57,984 - INFO - allennlp.common.util - Metrics: { "best_epoch": 12, "best_validation__match/exact_match": 0.327852998065764, "best_validation_sql_match": 0.43036750483558994, "best_validation__others/action_similarity": 0.5266099921420543, "best_validation__match/match_single": 0.552158273381295, "best_validation__match/match_hard": 0.28870292887029286, "best_validation_beam_hit": 0.5674081237911025, "best_validation_loss": 7.97007417678833, "peak_cpu_memory_MB": 5171.892 }
What does this mean?
Hi Ben,
How long did it take you to train the model, and on what GPU? I ran training on a K80 and it took ~13 hours and 55 epochs until convergence, and was wondering if you experienced the same.
Thanks,
Ben
Hi
when I am trying to predict sql queries for different questions, the model is generating a query and when the query involves a where or having clause, it is generating 'value' text in the predicate instead of the original entity value.
eg:
question: "What is the email id for the person with first name as 'asp'"
predicted : select leads.email from leads where leads.firstname = ' value '
expected : select leads.email from leads where leads.firstname = ' asp'
I am also able to see the similar queries for the examples you mentioned in the paper and I didn't find any information regarding this in the paper. Is it some thing that is expected with this model? what is the reason for this and is there any other way to get the entity value instead of "value" text.
Your response is much appreciated. Thank you
I run the predict command, and find out that the output SQL is masked with "value" (as shown in the picture). Can I configure the code to make it generate the complete SQL? Thank you!
This is the script I used to get the prediction
import json
import shutil
import sys
from allennlp.commands import main
overrides = json.dumps({"dataset_reader": {"keep_if_unparsable": True}})
name_of_experiment = "try20200516"
experiment_dir = f"experiments/{name_of_experiment}"
dataset_dir = "dataset/dev.json"
serialization_dir = f"experiments/{name_of_experiment}"
output_file = f"experiments/{name_of_experiment}/predictionQA.json"
# Assemble the command into sys.argv
sys.argv = [
"allennlp", # command name, not used by main
"predict",
experiment_dir,
dataset_dir,
"--predictor", "spider",
"--use-dataset-reader",
"--cuda-device=-1",
"--output-file", output_file,
"--silent",
"--include-package", "models.semantic_parsing.spider_parser",
"--include-package", "dataset_readers.spider",
"--include-package", "predictors.spider_predictor",
"--weights-file", f"experiments/{name_of_experiment}/best.th",
"-o", overrides,
]
main()
Hello,
My setup - Cuda 10, Ubuntu 18.04, Python 3.68, Docker container
I followed all the instructions but, when I tried running the training command -
I get this below error.
Traceback (most recent call last):
File "/opt/conda/bin/allennlp", line 10, in
sys.exit(run())
File "/opt/conda/lib/python3.6/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/opt/conda/lib/python3.6/site-packages/allennlp/commands/init.py", line 100, in main
import_submodules(package_name)
File "/opt/conda/lib/python3.6/site-packages/allennlp/common/util.py", line 314, in import_submodules
module = importlib.import_module(package_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 941, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'dataset_readers'
Any ideas if this is a version compatibility issues?
Hi ,
It seems that through the allennlp predict command, it is accepting a test file only in .json format. so is there any way to provide a single question text as an argument and predict the sql query for that single question.
Thank You
Hi,
I would kindly ask for help for generating the ".sqlite" file for a new database/table.
For my own data, i create the directory "mytable", under which I put the schema.sql file and use ".rad" to open and interrogate it in sqlite. It works fine.
But I do not know how to generate the ".sqlite" file file, like "mutable.sqlite".
I appreciate any piece of advice.
thank you
Hi! Training was stopped after it "ran out of patience" is it possible to continue training from the last check point?
The following is the output when the training was stopped:
2019-06-13 20:13:05,860 - INFO - allennlp.training.trainer - Ran out of patience. Stopping training. 2019-06-13 20:13:05,861 - INFO - allennlp.training.checkpointer - loading best weights 2019-06-13 20:13:05,883 - INFO - allennlp.models.archival - archiving weights and vocabulary to experiments/name_of_experiement/model.tar.gz 2019-06-13 20:13:07,237 - INFO - allennlp.common.util - Metrics: { "best_epoch": 12, "peak_cpu_memory_MB": 6957.568, "training_duration": "04:18:55", "training_start_epoch": 0, "training_epochs": 31, "epoch": 31, "training__match/exact_match": 0, "training_sql_match": 0, "training__others/action_similarity": 0, "training__match/match_single": 0, "training__match/match_hard": 0, "training_beam_hit": 0, "training_loss": 1.6497428404544694, "training_cpu_memory_MB": 6957.568, "validation__match/exact_match": 0.32108317214700194, "validation_sql_match": 0.41682785299806574, "validation__others/action_similarity": 0.5264176853154189, "validation__match/match_single": 0.5413669064748201, "validation__match/match_hard": 0.2719665271966527, "validation_beam_hit": 0.5411992263056092, "validation_loss": 10.870468139648438, "best_validation__match/exact_match": 0.327852998065764, "best_validation_sql_match": 0.43036750483558994, "best_validation__others/action_similarity": 0.5266099921420543, "best_validation__match/match_single": 0.552158273381295, "best_validation__match/match_hard": 0.28870292887029286, "best_validation_beam_hit": 0.5674081237911025, "best_validation_loss": 7.97007417678833 } error with SELECT count(*) FROM follows GROUP BY f1 Error col: "value" error with SELECT T1.company_name FROM Third_Party_Companies AS T1 JOIN Maintenance_Contracts AS T2 ON T1.company_id = T2.maintenance_contract_company_id JOIN Ref_Company_Types AS T3 ON T1.company_type_code = T3.company_type_code ORDER BY T2.contract_end_date DESC LIMIT 1 'ref_company_types'
Hi,
I used some custom data and my own statements to check the performance. Below are some of the predicted outputs. And most of them do not make sense.
select count ( * ) from 'mock'
select count ( * ) from 'mock' where ' value '
select * from 'mock' where count ( * ) > ' value '
select count ( * ) from 'mock' group by count ( * ) having ' value '
select count ( * ) , count ( * ) from 'mock' group by count ( * ) having ' value '
NO PREDICTION
NO PREDICTION
NO PREDICTION
select count ( * ) from 'mock' group by count ( * ) having count ( * ) > ' value '
NO PREDICTION
select * from 'mock' where count ( * ) > ' value '
select count ( * ) from 'mock' group by count ( * ) having count ( * ) > ' value '
NO PREDICTION
select count ( * ) , mock.gender from ( select count ( * ) from 'mock' where count ( * ) = ( select count ( * ) from 'mock' ) ) group by count ( * )
Not being accurate is one thing but getting grammatically incorrect tokens in SQL was unexpected. Do you think while decoding the SQL grammar is not being applied properly (some issues with my pipeline) or while encoding the schema tokens are not being applied to the embedding vectors. Any help is much appreciated.
Thanks & Regards,
Abhi
Hi,
I am just trying to predict sql queries for english statements. I am using the same command as mentioned for "Inference". In my input .json file, I removed all the tokens except for "db_id" and "question".
At first I had to change load_cache = False to run the program.
Then the "spider.text_to_instance" function had some issues. It was returning a None for instance. It was expecting a sql query as one of the parameters. - I put sql = [] to make it return a non-Null instance. But, I think action_sequence is now empty.
Can you please give me a brief explanation on what changes to make to just predict without any evaluation and what exactly are spiderworld and action_sequence doing?
Thank you!
Best regards,
Abhi
Hi, I got this problem when I run this project. And I searched for the solution from here. But it didn't work.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.