GithubHelp home page GithubHelp logo

brunnurs / valuenet Goto Github PK

View Code? Open in Web Editor NEW
66.0 66.0 23.0 27.44 MB

ValueNet: A Neural Text-to-SQL Architecture Incorporating Values

License: Apache License 2.0

Python 99.33% Dockerfile 0.46% Shell 0.16% Makefile 0.06%

valuenet's People

Contributors

brunnurs avatar ckosten avatar delixfe avatar groovytron avatar kurtstockinger avatar ruizcrp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

valuenet's Issues

Exception: 'NoneType' object is not subscriptable

ERROR!!! HTTP: 400. for request 'what is the avearge price of all five star hotel'
{
"error": {
"code": 400,
"message": "API key not valid. Please pass a valid API key.",
"status": "INVALID_ARGUMENT",
"details": [
{
"@type": "type.googleapis.com/google.rpc.Help",
"links": [
{
"description": "Google developers console",
"url": "https://console.developers.google.com"
}
]
}
]
}
}

Exception: 'NoneType' object is not subscriptable

Missing values when replicating train.json file

Hi,

Thanks for your work.

I'm working on replicating the train.json file, so I followed the scripts on .run folder (prefix'#'1-'#'4). However, the generated train.json file has some differences, including some missing values compared to the original train.json file you released. Could you please clarify why I'm not able to replicate the same exact train.json file used during Valuenet training? Maybe I'm missing a step in this process (see differences below). Thanks for your time.

Differences between original train.json (for Valuenet training) and generated train.json file (created with scripts with prefix'#'1-'#'4 in .run folder):

  1. Missing values: diff between original and generated files show some missing values in generated train.json: e.g., for the "question": "List the name, born state and age of the heads of departments ordered by age.", the values include "State" (Line 622 in original train.json), but this value is not found in generated train.json ("values":[]).
  2. The order of values found is different for many queries so rule in some of these cases is different too (maybe not so relevant if rule is valid).
  3. Generated train.json has a new field in all question entries in the file, "all_values_found", this entry is not found in original train.json.

@brunnurs again thanks for any clarification on this matter.

Facing error if its more than 512 tokens

If i want to use more than 512 tokens bert is not allowing and i am facing an error, can anyone help me how to use bigger database which is having 2000+ tokens.

error while executing some natural language queries in manual_inference.py

I'm using the school_finance database for manual inference and a model that I have trained.
Here is a snapshot of the table in the database which I'm querying.
image

Question: what is the amount of donation given by Monte plata?

Error:

image
After carefully analyzing the code, step by step, I found out that the results variable in _inference_semql() [manual_inference.py] has no element. However, in the same function, we are trying to access results[0].actions. Can you help me out here? What is wrong and how to solve this?

Thanks in advance!

TypeError: string indices must be integers

I'm getting this error while trying to implement the project Valuenet (https://github.com/brunnurs/valuenet) on my laptop and I don't understand what exactly it means.

I'm getting this error after I try to train the Model using:- python src/main.py

Configuration:- Windows 10 home, Nvidia gtx 1650 with cuda enabled.

Output of wandb log file:-

** parsed configuration from command line and combine with constants **

argument: exp_name=exp

argument: seed=90

argument: toy=False

argument: data_set=spider

argument: batch_size=1

argument: cuda=False

argument: encoder_pretrained_model=bert-base-uncased

argument: max_seq_length=512

argument: num_epochs=5.0

argument: lr_base=0.001

argument: lr_connection=0.0001

argument: lr_transformer=2e-05

argument: scheduler_gamma=0.5

argument: max_grad_norm=1.0

argument: clip_grad=5.0

argument: loss_epoch_threshold=50

argument: sketch_loss_weight=1.0

argument: column_pointer=True

argument: embed_size=300

argument: hidden_size=300

argument: action_embed_size=128

argument: att_vec_size=300

argument: type_embed_size=128

argument: col_embed_size=300

argument: readout=identity

argument: column_att=affine

argument: dropout=0.3

argument: beam_size=5

argument: decode_max_time_step=40

argument: data_dir=data\spider

argument: model_output_dir=experiments

Run experiment 'exp__20210517_150422'

We use the device: 'cpu' and 0 gpu's.

Loading from datasets…

Load data from data\spider\original\tables.json. N=166

Load data from data\spider\train.json. N=7000

Load data from data\spider\dev.json. N=1032

Successfully loaded pre-trained transformer 'bert-base-uncased'

Use Column Pointer: True

Build optimizer and scheduler. Total training steps: 35000.0

Start training with 5.0 epochs

0%| | 0/5 [00:00<?, ?it/s]

Training: 0%| | 0/7000 [00:00<?, ?it/s]

Training: 0%| | 0/7000 [00:00<?, ?it/s]

0%| | 0/5 [00:00
sketch_loss_weight=sketch_loss_weight)

File "E:\Project\Implementation\valuenet1\valuenet\src\training.py", line 32, in train
sketch_loss, lf_loss = model.forward(examples)

File "E:\Project\Implementation\valuenet1\valuenet\src\model\model.py", line 117, in forward
batch.values)

File "D:\Python\Anaconda3\envs\valuenet\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)

File "E:\Project\Implementation\valuenet1\valuenet\src\model\encoder\encoder.py", line 65, in forward
averaged_hidden_states_question, pointers_after_question = self._average_hidden_states_question(last_hidden_states, all_question_span_lengths)

File "E:\Project\Implementation\valuenet1\valuenet\src\model\encoder\encoder.py", line 141, in _average_hidden_states_question
averaged_span = torch.mean(last_hidden_states[batch_itr_idx, pointer: pointer + span_length, :],

TypeError: string indices must be integers

when i import tables.json file its now working

When i tried to import tables.json file to mysqlite its not coming as expected and i am not getting the proper output, kindly let me know how to import the json file to mysqlite or inside original folder database folder is missing (cre_Theme_park.sqlite) is there any where i can download the db file and run the testing?

Question on how to reproduce the result

Hello,

I am trying to reproduce the result using the parameters specified in your paper. However, I don't have as much memory in my GPU, so I set the batch size to 10 instead of 20. I can achieve only 45% accuracy so far after about 40 epochs, I tried multiple time. I tried to set 2 of the learning rates to half the original size with no effect. Can you give me some advice on how I can reproduce your result(fine-tune the model) with a batch size of 10? Thank you so much.

Best

Questions about code understanding in example_buider.py

Hi~ Thanks for your contribution. It's a real meaningful work. Please tolerate my poor English expression.

I tried the project in Chinese with the dataset: DuSQL (I didn't participate in this competition, just use the data to do some try.)

while I find some code I can't understand: example_builder.py, line 39

process_dict['col_set_iter'][0] = ['count', 'number', 'many']

At first I thought this line of code is doesn't effect on training, so I commented this line. And after training, when I use the evaluate.py with dev data, I found this line of code would effect on final result, like this:

  1. comment the line
                     easy                 medium               hard                 extra                all                 
count                156                  678                  421                  421                  1676                
=====================   EXECUTION ACCURACY     =====================
execution            0.596                0.360                0.287                0.223                0.329               
====================== EXACT MATCHING ACCURACY =====================
exact match          0.455                0.447                0.430                0.406                0.433  
  1. don't comment, use: process_dict['col_set_iter'][0] = ['count', 'number', 'many']
                     easy                 medium               hard                 extra                all                 
count                156                  678                  421                  421                  1676                
=====================   EXECUTION ACCURACY     =====================
execution            0.660                0.401                0.302                0.247                0.362               
====================== EXACT MATCHING ACCURACY =====================
exact match          0.609                0.503                0.542                0.575                0.541 
  1. don't comment, use: process_dict['col_set_iter'][0] = ['列名1', '列名2', '列名3']
                     easy                 medium               hard                 extra                all                 
count                156                  678                  421                  421                  1676                
=====================   EXECUTION ACCURACY     =====================
execution            0.628                0.364                0.285                0.230                0.335               
====================== EXACT MATCHING ACCURACY =====================
exact match          0.500                0.450                0.442                0.442                0.450 

I am not sure if i misunderstanding something in careless, could u give me some advice?
Thank you!

Error for number columns during looking for potential candidates

I found an issue with example question on hack_zurich database: "What is the share of electric cars in 2016 for Wetzikon?". The program has a problem with asking the Postgres database both on deployed instance and when running locally.

The response has 500 http code:
image

It seems that there is a problem with translating the value for the integer column inside the database according to the logs provided by the application run locally, which is visible in the last 5 lines of the log dump below. I looked into the database provided by the scripts in this repository, and it looks like the passenger_cars_per_1000_inhabitants column is varchar while it probably should be the numeric column.

127.0.0.1 - - [27/Sep/2022 17:48:31] "PUT /question/hack_zurich HTTP/1.1" 200 -
provided API-Key is 1234
{'question': 'What is the share of electric cars in 2016 for Wetzikon?', 'beam_size': 2}
question has been tokenized to : ['What', 'is', 'the', 'share', 'of', 'electric', 'cars', 'in', '2016', 'for', 'Wetzikon', '?']
HTTP: 200. for request 'What is the share of electric cars in 2016 for Wetzikon?'

Process example idx: 0
Question: What is the share of electric cars in 2016 for Wetzikon?
SQL: DUMMY
Look for potential candidates "[('Wetzikon', 0.7), ('share', 0.7), ('cars', 0.7), ('2016', 1.0)]" in database hack_zurich (include primary keys: True)
[2022-09-27 17:49:03,612] ERROR in app: Exception on /question/hack_zurich [PUT]
Traceback (most recent call last):
  File "C:\Users\tomas\anaconda3\envs\cuda\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\Users\tomas\anaconda3\envs\cuda\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "C:\Users\tomas\anaconda3\envs\cuda\lib\site-packages\flask_cors\extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "C:\Users\tomas\anaconda3\envs\cuda\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "C:\Users\tomas\anaconda3\envs\cuda\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "D:\Badawczy\git\valuenet\src\manual_inference\manual_inference_api.py", line 137, in pose_question
    example_pre_processed = _pre_processing(example_merged[0], db_value_finder, args.ner_api_secret)
  File "D:\Badawczy\git\valuenet\src\manual_inference\helper.py", line 50, in _pre_processing
    token_grouped, token_types, column_matches, value_candidates, _ = pre_process(0, example, ner_information, db_value_finder, is_training=False)
  File "D:\Badawczy\git\valuenet\src\preprocessing\pre_process.py", line 256, in pre_process
    value_candidates, all_values_found, column_matches = lookup_database(example, ner_information, columns,
  File "D:\Badawczy\git\valuenet\src\preprocessing\pre_process.py", line 113, in lookup_database
    database_matches = match_values_in_database(database_value_finder, potential_value_candidates,
  File "D:\Badawczy\git\valuenet\src\named_entity_recognition\pre_process_ner_values.py", line 92, in match_values_in_database
    matching_db_values = db_value_finder.find_similar_values_in_database(candidates, include_primary_keys)
  File "D:\Badawczy\git\valuenet\src\named_entity_recognition\database_value_finder\database_value_finder_postgresql.py", line 71, in find_similar_values_in_database
    matches = self._find_matches_in_column_by_exact_matching(table, column, potential_numeric_values, conn)
  File "D:\Badawczy\git\valuenet\src\named_entity_recognition\database_value_finder\database_value_finder_postgresql.py", line 112, in _find_matches_in_column_by_exact_matching
    cursor.execute(f"""
psycopg2.errors.UndefinedFunction: operator does not exist: character varying = integer
LINE 3: ...           WHERE passenger_cars_per_1000_inhabitants = 2016;
                                                                ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

127.0.0.1 - - [27/Sep/2022 17:49:03] "PUT /question/hack_zurich HTTP/1.1" 500 -

"message": "API key not valid. Please pass a valid API key.",

hI,

i am trying to run the code in its basic "play around" mode:

python src/manual_inference/manual_inference.py --model_to_load=pretrained_models/trained_model.pt --database=cre_Theme_park

and I am getting the following error

ERROR!!! HTTP: 400. for request 'how many parks are there'
{
"error": {
"code": 400,
"message": "API key not valid. Please pass a valid API key.",
"status": "INVALID_ARGUMENT",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "API_KEY_INVALID",
"domain": "googleapis.com",
"metadata": {
"service": "language.googleapis.com"
}
}
]
}
}

Could you kindly advice on what's need to be done?

Thank you

Problem with preprocessing the data

Hello,

I am having some problems preprocessing the data from scratch.

I've tried to create ner_train.json file with script src/named_entity_recognition/api_ner/extract_values.py.
However, I found that the key "values" is not generated by the script.

Thus, I am not able to run src/named_entity_recognition/pre_process_ner_values.py, which I believe to be the next step for preprocessing.

Am I missing something here? How can I generate preprocessed data?

Best

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.