GithubHelp home page GithubHelp logo

benbogin / spider-schema-gnn Goto Github PK

View Code? Open in Web Editor NEW
136.0 136.0 45.0 69 KB

Author implementation of the paper "Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing"

Python 98.64% Jsonnet 1.36%

spider-schema-gnn's People

Contributors

benbogin avatar wlhgtc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

spider-schema-gnn's Issues

allennlp Prediction parameter

Hi @benbogin ,
I was able to complete successful training and reproduce your benchmarks. Thanks for the excellent code and article.

While trying to predict using trained net with this command:

allennlp predict \ --include-package dataset_readers.spider \ --include-package models.semantic_parsing.spider_parser \ --output-file simple_test_output.json \ --cuda-device 0 \ --weights-file experiments/name_of_experiment/best.th \ experiments/name_of_experiment/model.tar.gz \ experiments/name_of_experiment/simple_test_input.json

I Received this error:
allennlp.common.checks.ConfigurationError: 'No default predictor for model type spider.\nPlease specify a predictor explicitly.'

Can you please advice with a simple example predicting even on the train/dev/val spider set?

I have encountered such problems in the process of processing data

command:allennlp train train_configs/paper_defaults.jsonnet -s experiments/paper_name_of_experiment --include-package dataset_readers.spider --include-package models.semantic_parsing.spider_parser
problem:
0it [00:00, ?it/s]2021-03-09 15:39:32,619 - INFO - dataset_readers.spider - Trying to load cache from cache\dev.json
175it [00:20, 8.34it/s]
Traceback (most recent call last):
File "e:\anaconda\envs\f\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "e:\anaconda\envs\f\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "E:\anaconda\envs\f\Scripts\allennlp.exe_main
.py", line 7, in
File "e:\anaconda\envs\f\lib\site-packages\allennlp\run.py", line 18, in run
main(prog="allennlp")
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands_init
.py", line 101, in main
args.func(args)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands\train.py", line 103, in train_model_from_args
args.force)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands\train.py", line 136, in train_model_from_file
return train_model(params, serialization_dir, file_friendly_logging, recover, force)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\commands\train.py", line 184, in train_model
pieces = TrainerPieces.from_params(params, serialization_dir, recover) # pylint: disable=no-member
File "e:\anaconda\envs\f\lib\site-packages\allennlp\training\trainer.py", line 739, in from_params
all_datasets = training_util.datasets_from_params(params)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\training\util.py", line 146, in datasets_from_params
validation_data = validation_and_test_dataset_reader.read(validation_data_path)
File "e:\anaconda\envs\f\lib\site-packages\allennlp\data\dataset_readers\dataset_reader.py", line 73, in read
instances = [instance for instance in Tqdm.tqdm(instances)]
File "e:\anaconda\envs\f\lib\site-packages\allennlp\data\dataset_readers\dataset_reader.py", line 73, in
instances = [instance for instance in Tqdm.tqdm(instances)]
File "e:\anaconda\envs\f\lib\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File ".\dataset_readers\spider.py", line 93, in _read
ex = fix_number_value(ex)
File ".\dataset_readers\dataset_util\spider_utils.py", line 143, in fix_number_value
ex['query_toks'][i_val_end + 1].lower() != ex['query_toks_no_value'][i_no_val + 1].lower():
IndexError: list index out of range

RuntimeError: CUDA out of memory.

I ran on 1 gpu with batch_size=13 and got cuda memory issue. It works if I reduce batch_size but your default batch_size is 15, how did you train, did you setup multiple gpus - how to do that?

Problem with torch_scatter

Hi @benbogin
I tried to predict some queries. But when I run the predictor, I got an error:

Traceback (most recent call last): File "/home/qwy/anaconda3/bin/allennlp", line 10, in <module> sys.exit(run()) File "/home/qwy/anaconda3/lib/python3.7/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/qwy/anaconda3/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 100, in main import_submodules(package_name) File "/home/qwy/anaconda3/lib/python3.7/site-packages/allennlp/common/util.py", line 314, in import_submodules module = importlib.import_module(package_name) File "/home/qwy/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 728, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "./models/semantic_parsing/__init__.py", line 1, in <module> from models.semantic_parsing.spider_parser import SpiderParser File "./models/semantic_parsing/spider_parser.py", line 18, in <module> from torch_geometric.data import Data, Batch File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module> from .data import Data File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/data/data.py", line 4, in <module> from torch_geometric.utils import (contains_isolated_nodes, File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/utils/__init__.py", line 2, in <module> from .scatter import scatter_ File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_geometric/utils/scatter.py", line 1, in <module> import torch_scatter File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/__init__.py", line 3, in <module> from .mul import scatter_mul File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/mul.py", line 3, in <module> from torch_scatter.utils.ext import get_func File "/home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/utils/ext.py", line 2, in <module> import torch_scatter.scatter_cpu

ImportError: /home/qwy/anaconda3/lib/python3.7/site-packages/torch_scatter/scatter_cpu.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1019UndefinedTensorImpl10_singletonE

I still remember when I trained the models, I got the same problem with torch_scatter. But I just reinstalled torch_scatter and it worked. This time reinstalling doesn't work. Could you please help me?

Inconsistent behavior on reading values in train/test

Hi Bogin, thanks for open-sourcing the excellent code! I have learned a lot from your code :) However, I have one strange question about the entity string and want to ask for your help.

From my personal viewpoint, the string entity is tied with database values, right? (Refer to function read_dataset_values and get_entities_from_question ). However, as far as I know, the Spider dataset does not allow to read values in database in test. So I guess there should be a gap between train and test: given a utterance, in training, we could somehow know whether one word is corresponding to the database values; But in test, we cannot even get the string entity.

But from the small difference gap between train and test performance in the leaderboard, I do not know whether my guess is correct. So I want to know, will you OBTAIN the database values in test? If no, how to handle the above inconsistent problem? And what it the insight or motivation to read the dataset values? Thanks for your patience, looking forward to your response!

Adding my own database_issue

Hello,
I created my own database directory and added my schema.sql and <database_name>.sqlite file.
I added the values to tables.json as well.

But, when I am running the allennlp predict command

  • I had trouble initially with 'is_global_variable = True' for table name. It turned out that my table name is being read as " 'mock' " instead of "mock".
  • I tried checking the names in the schema, tables.json and .sqlite file. I am not able to pinpoint the issue.
  • I made a small correction to the validation for "is_global_item" to account for single quotes and moved on.
  • But, now, I have issues with entities and entity_maps.
    File "./models/semantic_parsing/spider_parser.py", line 592, in
    entity_ids = [entity_map[entity] for entity in entities]
    KeyError: "'mock'"

Please let me know if this is a known error and if there is a workaround.

How to run prediction without provide `query_toks_no_value` into the test data ?

I tried to remove query_toks_no_value from the spider test data to run the predict but it's crashed. It makes no sense to include that field when running the prediction :S


  File "/venv/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 76, in read
    "Is the path correct?".format(file_path))
allennlp.common.checks.ConfigurationError: 'No instances were read from the given filepath /spider-schema-gnn/test/dev.json. Is the path correct?'

problem about the result

I used paper_defaults.jsonnet to reproduce the result of 0.40 in the paper, but with another configuration defaults.jsonnet ran for many times, the maximum reached 0.41 instead of 0.47. I am a little confused, can you help me approach the problem, thank you!

Problem with training on CPU (apex)

I tried to train the model on a CPU-based system, but I got the following error:

2019-06-03 10:10:17,299 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . usage: allennlp allennlp: error: unrecognized arguments: ./train.sh: line 5: --include-package: command not found

I cannot install apex since I don't have CUDA on my system. Is it possible to train the model on a CPU, or having a GPU is part of the repository requirement?

Thank for the help!

Training issue: Ran out of patience. Stopping Training

For some reason when I am training the model it is stopped with the error:
2019-07-02 13:16:56,613 - INFO - allennlp.training.trainer - Ran out of patience. Stopping training. 2019-07-02 13:16:56,614 - INFO - allennlp.training.checkpointer - loading best weights 2019-07-02 13:16:56,635 - INFO - allennlp.models.archival - archiving weights and vocabulary to experiments/name_of_experiement/model.tar.gz 2019-07-02 13:16:57,984 - INFO - allennlp.common.util - Metrics: { "best_epoch": 12, "best_validation__match/exact_match": 0.327852998065764, "best_validation_sql_match": 0.43036750483558994, "best_validation__others/action_similarity": 0.5266099921420543, "best_validation__match/match_single": 0.552158273381295, "best_validation__match/match_hard": 0.28870292887029286, "best_validation_beam_hit": 0.5674081237911025, "best_validation_loss": 7.97007417678833, "peak_cpu_memory_MB": 5171.892 }
What does this mean?

[Question] Training Time

Hi Ben,

How long did it take you to train the model, and on what GPU? I ran training on a K80 and it took ~13 hours and 55 epochs until convergence, and was wondering if you experienced the same.

Thanks,
Ben

predicting 'value' in where and having clauses instead of the actual entity value

Hi
when I am trying to predict sql queries for different questions, the model is generating a query and when the query involves a where or having clause, it is generating 'value' text in the predicate instead of the original entity value.
eg:
question: "What is the email id for the person with first name as 'asp'"
predicted : select leads.email from leads where leads.firstname = ' value '
expected : select leads.email from leads where leads.firstname = ' asp'

I am also able to see the similar queries for the examples you mentioned in the paper and I didn't find any information regarding this in the paper. Is it some thing that is expected with this model? what is the reason for this and is there any other way to get the entity value instead of "value" text.

Your response is much appreciated. Thank you

question about the output SQL

I run the predict command, and find out that the output SQL is masked with "value" (as shown in the picture). Can I configure the code to make it generate the complete SQL? Thank you!
image

This is the script I used to get the prediction

import json
import shutil
import sys

from allennlp.commands import main

overrides = json.dumps({"dataset_reader": {"keep_if_unparsable": True}})
name_of_experiment = "try20200516"
experiment_dir = f"experiments/{name_of_experiment}"
dataset_dir = "dataset/dev.json"
serialization_dir = f"experiments/{name_of_experiment}"
output_file = f"experiments/{name_of_experiment}/predictionQA.json"

# Assemble the command into sys.argv
sys.argv = [
    "allennlp",  # command name, not used by main
    "predict",
    experiment_dir,
    dataset_dir,
    "--predictor", "spider",
    "--use-dataset-reader",
    "--cuda-device=-1",
    "--output-file", output_file,
    "--silent",
    "--include-package", "models.semantic_parsing.spider_parser",
    "--include-package", "dataset_readers.spider",
    "--include-package", "predictors.spider_predictor",
    "--weights-file", f"experiments/{name_of_experiment}/best.th",
    "-o", overrides,
]

main()

AllenNLP "dataset_readers" module

Hello,
My setup - Cuda 10, Ubuntu 18.04, Python 3.68, Docker container

I followed all the instructions but, when I tried running the training command -
I get this below error.
Traceback (most recent call last):
File "/opt/conda/bin/allennlp", line 10, in
sys.exit(run())
File "/opt/conda/lib/python3.6/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/opt/conda/lib/python3.6/site-packages/allennlp/commands/init.py", line 100, in main
import_submodules(package_name)
File "/opt/conda/lib/python3.6/site-packages/allennlp/common/util.py", line 314, in import_submodules
module = importlib.import_module(package_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 941, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'dataset_readers'

Any ideas if this is a version compatibility issues?

How to predict for a single question

Hi ,

It seems that through the allennlp predict command, it is accepting a test file only in .json format. so is there any way to provide a single question text as an argument and predict the sql query for that single question.

Thank You

creating the ".sqlite" feel for new data

Hi,

I would kindly ask for help for generating the ".sqlite" file for a new database/table.

For my own data, i create the directory "mytable", under which I put the schema.sql file and use ".rad" to open and interrogate it in sqlite. It works fine.

But I do not know how to generate the ".sqlite" file file, like "mutable.sqlite".

I appreciate any piece of advice.

thank you

Continue training after "Ran out of patience"

Hi! Training was stopped after it "ran out of patience" is it possible to continue training from the last check point?

The following is the output when the training was stopped:
2019-06-13 20:13:05,860 - INFO - allennlp.training.trainer - Ran out of patience. Stopping training. 2019-06-13 20:13:05,861 - INFO - allennlp.training.checkpointer - loading best weights 2019-06-13 20:13:05,883 - INFO - allennlp.models.archival - archiving weights and vocabulary to experiments/name_of_experiement/model.tar.gz 2019-06-13 20:13:07,237 - INFO - allennlp.common.util - Metrics: { "best_epoch": 12, "peak_cpu_memory_MB": 6957.568, "training_duration": "04:18:55", "training_start_epoch": 0, "training_epochs": 31, "epoch": 31, "training__match/exact_match": 0, "training_sql_match": 0, "training__others/action_similarity": 0, "training__match/match_single": 0, "training__match/match_hard": 0, "training_beam_hit": 0, "training_loss": 1.6497428404544694, "training_cpu_memory_MB": 6957.568, "validation__match/exact_match": 0.32108317214700194, "validation_sql_match": 0.41682785299806574, "validation__others/action_similarity": 0.5264176853154189, "validation__match/match_single": 0.5413669064748201, "validation__match/match_hard": 0.2719665271966527, "validation_beam_hit": 0.5411992263056092, "validation_loss": 10.870468139648438, "best_validation__match/exact_match": 0.327852998065764, "best_validation_sql_match": 0.43036750483558994, "best_validation__others/action_similarity": 0.5266099921420543, "best_validation__match/match_single": 0.552158273381295, "best_validation__match/match_hard": 0.28870292887029286, "best_validation_beam_hit": 0.5674081237911025, "best_validation_loss": 7.97007417678833 } error with SELECT count(*) FROM follows GROUP BY f1 Error col: "value" error with SELECT T1.company_name FROM Third_Party_Companies AS T1 JOIN Maintenance_Contracts AS T2 ON T1.company_id = T2.maintenance_contract_company_id JOIN Ref_Company_Types AS T3 ON T1.company_type_code = T3.company_type_code ORDER BY T2.contract_end_date DESC LIMIT 1 'ref_company_types'

Unexpected outputs from custom data

Hi,
I used some custom data and my own statements to check the performance. Below are some of the predicted outputs. And most of them do not make sense.

select count ( * ) from 'mock'
select count ( * ) from 'mock' where ' value '
select * from 'mock' where count ( * ) > ' value '
select count ( * ) from 'mock' group by count ( * ) having ' value '
select count ( * ) , count ( * ) from 'mock' group by count ( * ) having ' value '
NO PREDICTION
NO PREDICTION
NO PREDICTION
select count ( * ) from 'mock' group by count ( * ) having count ( * ) > ' value '
NO PREDICTION
select * from 'mock' where count ( * ) > ' value '
select count ( * ) from 'mock' group by count ( * ) having count ( * ) > ' value '
NO PREDICTION
select count ( * ) , mock.gender from ( select count ( * ) from 'mock' where count ( * ) = ( select count ( * ) from 'mock' ) ) group by count ( * )

Not being accurate is one thing but getting grammatically incorrect tokens in SQL was unexpected. Do you think while decoding the SQL grammar is not being applied properly (some issues with my pipeline) or while encoding the schema tokens are not being applied to the embedding vectors. Any help is much appreciated.

Thanks & Regards,
Abhi

Prediction without validation

Hi,
I am just trying to predict sql queries for english statements. I am using the same command as mentioned for "Inference". In my input .json file, I removed all the tokens except for "db_id" and "question".

At first I had to change load_cache = False to run the program.

Then the "spider.text_to_instance" function had some issues. It was returning a None for instance. It was expecting a sql query as one of the parameters. - I put sql = [] to make it return a non-Null instance. But, I think action_sequence is now empty.

Can you please give me a brief explanation on what changes to make to just predict without any evaluation and what exactly are spiderworld and action_sequence doing?
Thank you!
Best regards,
Abhi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.