expressgnn / expressgnn Goto Github PK

Python 100.00%

expressgnn's Introduction

ExpressGNN

This is an implementation of the ExpressGNN proposed in the paper "Efficient Probabilistic Logic Reasoning with Graph Neural Networks".

Requirements

python 3.7
pytorch 1.1
scikit-learn
networkx
tqdm

Quick Start

The following command starts the inference on the Kinship-S1 dataset on GPU:

python -m main.train -data_root data/kinship/S1 -slice_dim 8 -batchsize 16 -use_gcn 1 -embedding_size 64 -gcn_free_size 32 -load_method 0 -exp_folder exp -exp_name kinship -device cuda

To run ExpressGNN on the FB15K-237 dataset on GPU, use the follwoing command line:

python -m main.train -data_root data/fb15k-237 -rule_filename cleaned_rules_weight_larger_than_0.9.txt -slice_dim 16 -batchsize 16 -use_gcn 1 -num_hops 1 -embedding_size 128 -gcn_free_size 127 -patience 20 -lr_decay_patience 100 -entropy_temp 1 -load_method 1 -exp_folder exp -exp_name freebase -device cuda

expressgnn's People

Contributors

Stargazers

Watchers

expressgnn's Issues

How are rules generated

I see that the rules in the code for the FB dataset are given in a TXT file. I want to know how each rule is generated? This is very important for my own data set, and I hope to get your reply.

Full EM algorithm for other datasets

I tried to use the full EM algorithm on the Cora dataset and ran into errors. I noticed that in the "preparation data for M-step" part of the code, there is a requirement that the rules have at most two negated atoms and more than one positive atom seems to give errors later in the code. Is this a restriction on the datasets if you want to use the full EM algorithm. I notice that Kinship, Cora, and UW-CSE cannot be used for full EM for these reasons. Thanks!

test_data in get_batch_by_q ?

In get_batch_by_q of dataset.py :

if validation:
    fact_ls = self.valid_fact_ls
else:
    fact_ls = self.test_fact_ls

So the fact_ls is either valid_data or test_data, and in E-step of train loop, test_data is used for train.
Is that test data leakage? Looking forward to your reply.

@expressGNN @yuyuz

ExpressGNN-EM for Kinship / UW-CSE / Cora data

Hi,

If I understand correctly, train.py implements ExpressGNN-EM for FB15k-237 and ExpressGNN-E for other datasets. If we'd like to run ExpressGNN-E for FB15k-237, we can just simply comment lines from 212 to 230, right?

If I'd like to run ExpressGNN-EM for other datasets, how to do it? Thanks!

Another question is about the result of FB15k-237. When I ran the command:

python -m main.train -data_root data/fb15k-237 -rule_filename cleaned_rules_weight_larger_than_0.9.txt -slice_dim 16 -batchsize 16 -use_gcn 1 -num_hops 1 -embedding_size 128 -gcn_free_size 127 -patience 20 -lr_decay_patience 100 -entropy_temp 1 -load_method 1 -exp_folder exp -exp_name freebase -device cuda

I think it runs ExpressGNN-EM and the paper reports 0.49 MMR and 0.608 HITS. However, I got MMR 0.4501 and HITS 0.5706, which look like the results of ExpressGNN-E, not ExpressGNN-EM. Would you tell me what goes wrong here? Thanks!

Test data leakage?

I've had a look at your recent ICLR20 paper; the results for FB15k-237 are outright amazing! I browsed the source code in this repository to better understand what you do. I stumbled across the following lines in dataset.py:

        for fact in query_ls:
            self.test_fact_ls.append((fact.val, fact.pred_name, tuple(fact.const_ls)))
            self.test_fact_dict[fact.pred_name].add((fact.val, tuple(fact.const_ls)))
            add_ht(fact.pred_name, fact.const_ls, self.ht_dict)

Here query_ls contains the test set facts, and add_ht registers the fact.

If I interpret this correctly, the MLN is constructed as follows. It first adds a variable for each fact r(e1,e2) in the training, validation, and test data. Afterwards, for each such fact, additional variables are (conceptually) added by perturbing e1 or e2: i.e., variables for all facts of form r(e1,?) and r(?,e2) are added as well.

Each of the so-obtained variables is marked as observed (if it appear in the training data) or latent (otherwise).

Is this understanding correct?

The reason I am asking is because such an approach seems to leak validation and test data into training. Why? It's true that the truth values of the validation and test data are not used during training. But: the choice of variables in the MLN already tells the MLN that r(e1,?) and r(?,e2) are sensible query, and consequently provides information about e1 and e2. That's fine for the training data facts. For validation and test facts, however, it's problematic.

For example, consider a test set fact married_to(JohnDoe, JaneDoe). The mere existence of the variables married_to(JohnDoe, ?) informs the (tuneable) embedding of JohnDoe: it must be a person. Likewise for married_to(?, JaneDoe). That's the first reason for potential leakage. Another reason is that, without any inference or learning, one may "look" at the set of created variables and reduce the set of potential wifes for JohnDoe to the set of persons that have been seen as wifes in the validation or test data. (All facts from the training data are observed so that the corresponding wifes are ruled out.) If so, this would significantly simplify the task.

I'd appreciate if you clarified whether the above description is accurate and, in particular, where I misunderstood the approach.

expressgnn / expressgnn Goto Github PK

expressgnn's Introduction

ExpressGNN

Requirements

Quick Start

expressgnn's People

Contributors

Stargazers

Watchers

Forkers

expressgnn's Issues

How are rules generated

Full EM algorithm for other datasets

test_data in get_batch_by_q ?

ExpressGNN-EM for Kinship / UW-CSE / Cora data

Test data leakage?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs