GithubHelp home page GithubHelp logo

gae's People

Contributors

gokceneraslan avatar philipjackson avatar shoniko avatar tkipf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gae's Issues

Error in Line#127 train.py

Shouldn't the preds in np.zeros Be preds_neg? Unless length of preds and preds_neg is the same.

    preds_all = np.hstack([preds, preds_neg])
    labels_all = np.hstack([np.ones(len(preds)), np.zeros(len(preds))])
    preds_all = np.hstack([preds, preds_neg])
    labels_all = np.hstack([np.ones(len(preds)), np.zeros(len(preds_neg))])

Thanks
Hritvik

what is the scipy version used for creating the input data

Hi tkipf,
I am trying to run your codes, but there are something wrong about the loaded data and this may caused by the discordance of the scipy version used for creating and loading data, so would you please show the scipy version you used or the codes you used to generate the data? Thanks~

image

IOError: bad message length

I am feeding a new undirected graph dataset with (V=18059, E=286535). It comes up with
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
send(obj)
IOError: bad message length

GAE for directed graphs + finding common patterns within a set of graphs

Hi Thomas,

First off: great work :)

Second: I would like to apply your GAE but for weighted directed graphs, without features.
I have a set of such graphs which (I hypothesize) contain similar sub-graphs and connection patterns between the nodes.
I would like to seek out those patterns, to then verify if they are indeed present (regardless of how I would do that).

My questions are:

  1. I think I can adapt your approach by using a directed graph Laplacian:
    https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.laplacian.html
    I would use a sigmoid activation function, and minimize a logloss (although it won't really have anything to do with likelihood).
    But am I missing something? Do you see a glaring hole in this kind of approach?

  2. I wanted to look for the common patterns in my graphs -- let's call them {G_1, G_2, G_3, ...} -- by embedding the nodes into a common space, in which (I hope) the patterns would be easier to find. In your other repo (https://github.com/tkipf/gcn) you're suggesting that when dealing with multiple graphs, one might create a big adjacency matrix by concatenating the adjacency matrices of these graphs {G_1, G_2, G_3, ...}. How large can this concatenated adjacency matrix be? Have you reached a limit? And if so, what was it?

  3. I was also wondering if you think it would be possible to look for patterns in the matrices W (self.vars['weights']) in the GraphConvolutionSparse layer (as an analogy to what you might do with filters in "classical" convolutional neural networks). Now that I think about it, it seems unlikely because the ordering of the rows/columns in the adjacency matrix is pretty much random. But maybe you've given this idea some thought and you'd be willing to point me to an answer: "yes, it can be done" or: "no, no way".

Please, let me know what you think, I would really appreciate it!

Why I got the 0 feature data for all nodes?

Hi, thank you again for your great work. Now I have another problem that I need your help.
I process many graphs one by one. Because the value of features_nonzero is different, and I can only initialize one model, so I did the following:
The embedding of each node like [0.0, 0.3024, 2.034 ... 0.0, 0.005, 0.03, 1.03], the len() is 2048, it contains many 0 value. In order to ensure features_nonzero is 2048, I replace 0 value with 1e-15, and I tried to get all node embeddings in a GAE model, but I got 0 value matrix

[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]

Why is that?If I don't, how do I process many graphs with one model?
What's more, I can't do like this:
34198790-eb5bec96-e56b-11e7-90d5-157800e042de

About input data

In order to use your own data, you have to provide

an N by N adjacency matrix (N is the number of nodes), and
an N by D feature matrix (D is the number of features per node) -- optional

Can you show me about the two matrix's format  and instance??thanks in advance

features format

I have some question about what is the feature format for x, tx, and allx in Cora dataset?
Are these sparse matrix?
Also, what is the meaning of the node feature? One hot? Edges the node connects?
How can I feed my feature in this model?
Thanks!

Input file and command interface

Hi,
Please let me ask an improvement.
Can you improve the input file format? It's a little bit complicated for who are machine learning beginner like me.
For example, splitting input data both for training and test by using folder structures, and using JSON for graph structures and features.
I tried to find out the spec of the current input data format details by using debugger but not yet cleared.
Please consider or give me your advice to make a pull-request to do it.

a bug in ismember function?

what does the function ismember in preprocessing.py used for?
to check whether an edge in an edge list?
For example:
a = [1, 2]
b = [[1,2], [2,3]]
ismember(a, b) will return False, but edge [1,2] in already in [[1,2], [2,3]].

Can I get the same feature in this case?

Hi,
I got the features of the whole graph by extracting the features of each node and adding them together with gae, but now I have a question, I hope to get your help.
Now I have two graphs, and 60% of them have the same nodes (the same presentation features, the same adjacency relationship). But GCN processes one graph at a time, so I can only input two graphs into the network individually. If so, are 60% of the node feature of the two graphs the same? What can I do to ensure that 60% of the nodes have the same feature?

Can I use this model for node classification too?

Hi,

I just made a little modification on vgae to perform node classification in an unsupervised fashion.

In detail, I just used full adjacency matrix since there is no need for link prediction. Everything else is the same as the original implementation.

After training the model, I chose randomly 40% of all nodes as training set for default logistic regression module of Sklearn. (as practiced by many authors including those of 'A graph autoencoder for attributed network embedding') I used 20% as validation and 40% as test set.

In cora dataset, I found out accuracy is just 31% on validation set with several runs(random splits on dataset). I don't see what is the problem with my approach and I want your advice. Or is it just that unsupervised approach is not appropriate for node classification?

Thanks a lot.

Run GAE with my own graph

Hi Thomas!
Thanks for your really great work! I try to use my self-designed graph to train this model. I provide the adjacency matrix like this:
[[0 1 0 1 0 0 0 0 0 0 0 0 0]
[1 0 1 0 0 0 0 0 0 0 0 0 0]
[0 1 0 1 0 0 0 0 0 0 0 0 0]
[1 0 1 0 1 0 0 0 0 0 0 0 0]
[0 0 0 1 0 1 0 0 1 0 0 0 0]
[0 0 0 0 1 0 1 0 0 0 0 0 0]
[0 0 0 0 0 1 0 1 0 0 0 0 0]
[0 0 0 0 0 0 1 0 1 0 0 0 0]
[0 0 0 0 1 0 0 1 0 1 0 0 0]
[0 0 0 0 0 0 0 0 1 0 1 0 1]
[0 0 0 0 0 0 0 0 0 1 0 1 0]
[0 0 0 0 0 0 0 0 0 0 1 0 1]
[0 0 0 0 0 0 0 0 0 1 0 1 0]]

and this is my input_data code:
def creat_my_graph():
G = nx.Graph()

G.add_nodes_from(range(1,5))
G.add_edges_from([[1,2],[2,3],[3,4],[4,1]])


G.add_nodes_from(range(5,10))
G.add_edges_from([(5,6),[6,7],[7,8],[8,9],[9,5]])

G.add_nodes_from(range(10,13))
G.add_edges_from([(10,11),[11,12],[12,13],[13,10]])

G.add_edges_from([(4,5),(9,10)])
nx.draw(G)
adj = nx.adjacency_matrix(G)
features = np.identity(adj.shape[0])
print(adj.todense())
plt.show()
return adj, features

But there is an error:
File "/Users/xieximing/android/gae-master_embedding/gae/preprocessing.py", line 59, in ismember
rows_close = np.all(np.round(a - b[:, None], tol) == 0, axis=-1)
ValueError: operands could not be broadcast together with shapes (0,) (30,1,2)

if I change the adjacency matrix like this:
[[0 1 1 1 1 1 0 0 0 0 0 0 0 0 0]
[1 0 1 1 1 1 0 0 0 0 0 0 0 0 0]
[1 1 0 1 1 1 0 0 0 0 0 0 0 0 0]
[1 1 1 0 1 1 0 0 0 0 0 0 0 0 0]
[1 1 1 1 0 1 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 1 0 1 1 1 1]
[0 0 0 0 0 0 0 0 0 1 1 0 1 1 1]
[0 0 0 0 0 0 0 0 0 1 1 1 0 1 1]
[0 0 0 0 0 0 0 0 0 1 1 1 1 0 1]
[0 0 0 0 0 0 0 0 0 1 1 1 1 1 0]]
this error doesn't show up.
I'm quite confused about it.

Generation of new graph from trained GVAE

Dear @tkipf,
I really appreciate your work and I would like to adapt your code to my own use cases. In particular, I would like to generate new graphs by sampling from a learned latent space, as usually done with images in Variational Autoencoder models. In such a case, indeed, new data (images) might be generated by sampling from a latent space which is constrained to be Normal distributed.
However, in your implementation, it is not really clear to me if this could be done.
As far as I understood, the reconstruction of the original adjacency matrix is performed by an inner product of the embedded input z_mean. This imply that in order to generate new graphs, I cannot sample from a standard Normal distribution since there would be no trained layers to be used. Do I understood correctly?
Is there any other way to train your model in order to sample from a Normal distribution after training the model?

Thanks in advance for your precious help.
Bests,

MemoryError

I use my own data, a 8424 X 8424 adjacent matrix, a 8424 X 768 feature matrix. When I ran the model, the following error occurred:
Traceback (most recent call last):
File "/home/yangjingying/PycharmProjects/gae-master/gae/train.py", line 53, in
adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = mask_test_edges(adj)
File "/home/yangjingying/PycharmProjects/gae-master/gae/preprocessing.py", line 101, in mask_test_edges
assert ~ismember(test_edges_false, edges_all)
File "/home/yangjingying/PycharmProjects/gae-master/gae/preprocessing.py", line 62, in ismember
rows_close = np.all(np.round(a - b[:, None], tol) == 0, axis=-1)
MemoryError

Sigmoid not used?

Hi @tkipf
Seems like a simple linear activation is used for the decoder?
If yes, why isn't sigmoid used? (as mentioned in the paper

Thanks in advance!

Large graphs

Hey
I'm trying to use node embeding (in order to find clustreing )
Using 71607 nodes, with 75662 edges

I got:
2019-09-04 12:21:14.928295: W tensorflow/core/framework/allocator.cc:124] Allocation of 20510249796 exceeds 10% of system memory.
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

It's look like due to memory error, is there anyway to overcome it?

I have about 64GB of RAM
Thanks!

How to deconstruct from decoder?

Hello,

I was working with the decoder of Autoencoder example, and wonder how to reconstruct a graph with decoder output.
If I understood right, the decoder computes edge probabilities, and I got them like below.

tensor([0.6745, 0.6745, 0.6551, 0.6551, 0.6286, 0.6286, 0.6391, 0.6391, 0.5424,
        0.5424, 0.5542, 0.5542, 0.5694, 0.5694, 0.5375, 0.5375, 0.4936, 0.4936,
        0.6254, 0.6254, 0.5815, 0.5815, 0.6029, 0.6029, 0.5186, 0.5186, 0.4716,
        0.4716, 0.5158, 0.5158, 0.5058, 0.5058, 0.5641, 0.5641, 0.4994, 0.4994,
        0.5030, 0.5030, 0.6071, 0.6071, 0.6064, 0.6064, 0.5250, 0.5250, 0.5184,
        0.5184, 0.5531, 0.5531, 0.5415, 0.5415, 0.5445, 0.5445, 0.5138, 0.5138,
        0.5075, 0.5075, 0.4968, 0.4968, 0.5199, 0.5199, 0.4946, 0.4946, 0.5524,
        0.5524, 0.5587, 0.5587, 0.5585, 0.5585, 0.5088, 0.5088, 0.4806, 0.4806,
        0.5119, 0.5119, 0.5122, 0.5122, 0.5117, 0.5117, 0.5116, 0.5116, 0.5283,
        0.5283, 0.5211, 0.5211, 0.5121, 0.5121, 0.5273, 0.5273, 0.5119, 0.5119,
        0.5117, 0.5117, 0.4990, 0.4990, 0.4986, 0.4986, 0.5036, 0.5036, 0.5067,
        0.5067, 0.4918, 0.4918, 0.4983, 0.4983, 0.5210, 0.5210, 0.5012, 0.5012,
        0.5017, 0.5017, 0.5477, 0.5477, 0.5475, 0.5475, 0.4924, 0.4924, 0.5084,
        0.5084, 0.5098, 0.5098, 0.5256, 0.5256, 0.5719, 0.5719, 0.5012, 0.5012,
        0.5010, 0.5010, 0.5020, 0.5020, 0.5064, 0.5064, 0.5063, 0.5063, 0.5221,
        0.5221, 0.6704, 0.6704, 0.5566, 0.5566, 0.6233, 0.6233, 0.5059, 0.5059,
        0.5069, 0.5069, 0.5085, 0.5085, 0.5048, 0.5048, 0.5051, 0.5051, 0.5887,
        0.5887, 0.6524, 0.6524, 0.5295, 0.5295, 0.5474, 0.5474, 0.5241, 0.5241,
        0.5059, 0.5059, 0.5568, 0.5568, 0.5497, 0.5497, 0.5727, 0.5727, 0.5397,
        0.5397, 0.5805, 0.5805, 0.5577, 0.5577, 0.5569, 0.5569, 0.6282, 0.6282,
        0.6124, 0.6124, 0.6134, 0.6134, 0.5117, 0.5117, 0.4991, 0.4991, 0.5032,
        0.5032, 0.6867, 0.6867, 0.6140, 0.6140, 0.6222, 0.6222, 0.6541, 0.6541,
        0.6641, 0.6641, 0.7468, 0.7468, 0.7686, 0.7686, 0.6775, 0.6775, 0.7056,
        0.7056, 0.7104, 0.7104, 0.5137, 0.5137], grad_fn=<SigmoidBackward>)

of which length is 106.
but my original data that I encoded has 212 edges and 117 nodes.

Data(edge_index=[2, 212], x=[117, 1])

How do I know which edge that each probability represents??

Can N by N adjacency matrix using float?

Hi.
I am new to GAE. I have graphs that nodes have distance each other. And the adjacency matrix will be fully-connected and with all 0's diagonal, which is used to denote the distance. So I was wondering was my graph suitable for the GAE? Because I haven't found similiar information in the paper.

Hope for your reply.
Thanks.

hyper-parameter learning

Hi Kipf,
I would like to know how you did hyper-parameter search.
Would be helpful for applying this code to other datasets.

what is the element of model.reconstructions in model.py?

Sorry, but I can't understand at the line 114 in file model.py, what's the element of self.reconstruction? probability value? But when I run the program, I print the variable reconstructions, and get the result:
reconstruction [ 2.3758938 -0.10602921 0.103589 ... 0.8698976 3.3979175
3.6787224 ]

about features and val_edges_false

Hi tkipf, thanks for your code! I have two questions about the data:

  1. In the "input_data.py", why did you adjust the order of the test nodes features in the "features"? Does this operation cause the index of the same node in "graph" to be inconsistent with the index in "features"?
  2. In the "preprocessing.py", when create the "val_edges_false", why is it created differently than the "test_edges_false"? Does this cause the "val_edges_false" to contain test edges that actually exist in the graph?
    These two questions have been bothering me for a long time. I am looking forward to your reply. Thank you!

About weighted graph

Hi Thomas!
Thanks for your really great work!
I try to use my weighted graph to train this model.
But I don't know if GAE model can be applied to the weighted graph, If it works, could you give some guidance?

how to add the edge features?

Hi @tkipf ,

I am trying to use my own data with gae,and the data has edge features (float from 0.0 to 1.0) .Is it suitable for gae,if so how can I improve my implementation? In addition, I need to look at the data after decode, where can I find out these results?

Looking forward to your reply!
Thanks

Low training accuracy while high ROC score

I'm using your code on a different dataset (SNOMED), so I use my own function to load the data and create all matrices similar to what you generated.

My issue is that training accuracy is decreasing throughout the training and it reaches 5% after 200 epochs while ROC and AP are increasing and are in the range of 88-90% after the training is finished.

Do you think this is an issue? i.e., are the values for ROC and AP are valid even with low training accuracy?

BTW, thanks for the awesome code!

Implementing RGCNs

Thanks for your really great work! I'd like to know if you plan to extend this GAE implementation for learning embeddings of multi-relational graphs (N x N x M) as in the recent RGCN paper from your group. I'm very excited about that paper and would like to help if there are ongoing efforts to provide an implementation. Do let me know!

bach training

Sorry to keep asking the question! But now I have two questions to confirm.

  1. I read your code carefully, and found that model training was just to find better weight parameter self.vars['weights']? (that is, only self.vars['weights'] is updated during the training process?)

  2. I found that feed_dict is unchanged in the training process. Now I have 1000 graphs(different adjacency matrix and variable number of nodes, but the same node's feature). So can I input data of different graphs in each epoch? If this is not possible, what is the reason?

Thanks for your patient answers!!!

Model storage

Hello Thomas!
Using the library I did not figure out how to store the model for subsequent use and how to get embeddings from it. Could you help me out with this?

Thank you!

Training set and accuracy calculation

Hi @tkipf
When I review the code, the training set you generated is a subgraph of the original graph, while in model when calculating the loss, the function weighted_cross_entropy_with_logits compares the pred_scores and the subgraph adjacency matrix.

The ones in the subgraph adjacency matrix represent the train_edge, while the zeros have both val_edges_false and test_edge_false as the trainings. Is that true?

If yes, what I think the loss should be calculated by sampling the adj_matrix with train_edges and train_edges_false, which represent both 1s and 0s.

Thus, what I think is to split the edges into
E_G = E_train + E_val + E_test
and split the non-existed edges into
\bar{E}_G = \bar{E}_train + \bar{E}_val + \bar{E}_test

Also, since the reconstructed matrix is not a matrix with values from [0, 1].
I think there is a problem when applying the sigmoid function to calculate the accuracy. Most of entries in the reconstruct matrix are positive, and it could leads to a bias calculation. Is that correct?

UnicodeDecodeError

Hi, I have an issue similar to one in 'gcn' (tkipf/gcn#6).

On running the train.py file the following error is observed:

vedang@vedang-HP-Pavilion-Notebook:~/gae/gae$ python train.py --dataset cora
Traceback (most recent call last):
  File "train.py", line 40, in <module>
    adj, features = load_data(dataset_str)
  File "/home/vedang/anaconda3/lib/python3.6/site-packages/gae-0.0.1-py3.6.egg/gae/input_data.py", line 19, in load_data
    objects.append(pkl.load("data/ind.{}.{}".format(dataset, names[i])))
  File "/home/vedang/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I changed the read format to open(filename, 'rb') but now I get the following error:

vedang@vedang-HP-Pavilion-Notebook:~/gae/gae$ python train.py --dataset cora
Traceback (most recent call last):
  File "train.py", line 40, in <module>
    adj, features = load_data(dataset_str)
  File "/home/vedang/anaconda3/lib/python3.6/site-packages/gae-0.0.1-py3.6.egg/gae/input_data.py", line 19, in load_data
    objects.append(pkl.load("data/ind.{}.{}".format(dataset, names[i], 'rb'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0x87 in position 32: ordinal not in range(128)

Can you please let me know what changes can be done or look into this issue?

InvalidArgumentError for large dataset

Hi Tomas,

I am able to use gae on the small dataset(<10k nodes) without any problem but when I tried on large dataset. I am having the following error:

2019-03-15 15:36:48.653709: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
File "train.py", line 187, in
outs = sess.run([opt.opt_op, opt.cost, opt.accuracy], feed_dict=feed_dict)
File "/home/gurukar.1/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/gurukar.1/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/gurukar.1/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/gurukar.1/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape output type is 32-bit but dim 0 is 3187957444
[[node optimizer/gradients/optimizer/logistic_loss/mul_1_grad/Shape_1 (defined at /scratch/line_verse/vag/vgae/gae/optimizer.py:15) = ShapeT=DT_FLOAT, out_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

I did some research (tensorflow/tensorflow#23107) and it seems that it relates to int32 datatype. I am thinking about change all int32 in optimizer.py from int32 to int64. Do u think it will work? I would like to consult you at first since it takes the quite long time to run on the large dataset.

Meaning of 'features' object

Dear @tkipf
first of all, thank you for the excellent work! Your paper and the provided code are helpful to get started with GCN.

I'm currently trying to apply your algorithm to my data.
After looking at the load_data() function, I was able to create an adjacency matrix in the same format as your Cora example.

However, I struggle with the node feature object, because I don't understand the meaning of the content cora.tx and cora.allx.
The shape is (2709x1433) (#41 nodes x #features), so apparently, there are 1433 node features.
The print is as follows

(0, 19) 1.0
(0, 81) 1.0
(0, 146) 1.0
(0, 315) 1.0
(0, 774) 1.0
(0, 877) 1.0
(0, 1194) 1.0
(0, 1247) 1.0
(0, 1274) 1.0
(1, 19) 1.0
(1, 88) 1.0
(1, 149) 1.0
(1, 212) 1.0
(1, 233) 1.0

How can we interpret the rows? I can't make sense of it.
This format is usually an edge list format, but as we have node features, how do the edges come into play?

I looked into your other repositories and issues around this topic and couldn't find anything which helps me understand the structure of the .tx and .allx files.

tkipf/gcn#36
tkipf/gcn#125
tkipf/gcn#114
tkipf/gcn#36
tkipf/gcn#22
#35

I'm planning to use node degree as recommended here tkipf/gcn#22 (and add more features later on)
My current attempt is to do

node_deg = dict(G.degree()).values()
features = sparse.csr_matrix(node_deg).T

As I don't understand the Cora output, I can't really assess if that is correct or not.

Could you provide more guidance and explanation for that?

That would be great :)
Thank you in advance,
Best,
Minh

Normalisation constant for KL divergence

Hey @tkipf Great work with the implementation. I was trying to reproduce the results from the workshop paper and noticed that the results were quite different. I double checked with your implementation and noticed that you divide the KL divergence with num_nodes here. This should not be required because we are already considering the mean across all the nodes. It would be great if you could throw some light on this.

Assertion error ismember

When I train gae with my own adjacency matrix, I receive assertion error in ismember function.

Traceback (most recent call last): File "train.py", line 49, in <module> adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = mask_test_edges(adj) File "build/bdist.linux-x86_64/egg/gae/preprocessing.py", line 99, in mask_test_edges assert ~ismember(val_edges_false, edges_all) AssertionError

I load my matrix as
adj = sp.load_npz('adj.npz')
by modifying train.py

And run gae as python2 train.py --features=0

What could be the reason and how to solve this?

epochs to train:200 vs.1000

Epoch: 0198 train_loss= 0.40438 train_acc= 0.80021 val_roc= 0.91772 val_ap= 0.93526 time= 0.42112
Epoch: 0199 train_loss= 0.40435 train_acc= 0.80037 val_roc= 0.91785 val_ap= 0.93544 time= 0.51987
Epoch: 0200 train_loss= 0.40434 train_acc= 0.80052 val_roc= 0.91780 val_ap= 0.93540 time= 0.49462
Optimization Finished!
Test ROC score: 0.923083293426
Test AP score: 0.929091873277

Epoch: 0999 train_loss= 0.39955 train_acc= 0.89461 val_roc= 0.86105 val_ap= 0.85885 time= 0.43444
Epoch: 1000 train_loss= 0.39954 train_acc= 0.89462 val_roc= 0.86104 val_ap= 0.85884 time= 0.45543
Optimization Finished!
Test ROC score: 0.873462980099
Test AP score: 0.893914158688

Loss function in optimizer.py

Hi @tkipf,

Thank you for sharing the implementation.

In the OptimizerVAE class, when defining the KL divergence, I think there is an (1/num_nodes)^2 term extra.
One num_nodes comes in (0.5 / num_nodes), and the other is introduced by tf.reduce_mean.

This contradicts the results in Auto-Encoding Variational Bayes by Kingma and Welling (Appendix B, Solution of -KL, Gaussian case).

Could you expound on this a bit more?

Thanks!

Edges in val_edges and test_edges are trained with label '0'?

Hi @tkipf ,

It seems model.reconstructions includes all the positive and negative edges, no matter in training, validation or test set, whereas the labels used during training only have "1" entries on train_edges. It confuses me since it looks like "the model is trained to score "1" for train_edges and "0" for val_edges and test_edges," how come the validation and test accuracy are that good? Would it be better to mask out val_edges and test_edges before feeding the model.reconstructions to the optimizer?

Looking forward to your reply.
Thanks.

Questions about pos_weight and norm

Hi @tkipf:
I have questions about 'pos_weight' in 77 line and 'norm' 78 line in https://github.com/tkipf/gae/blob/master/gae/train.py.

For unbalanced tasks, generally, we only need to rebalance the samples for training. Same as general case, you already have same number of negative samples and positive samples by negative sampling. I don't know the reason why we should put extra weight(pos_weight) to the positive samples.

So ,can you explain it?

Another question is that 'norm' seems to just scale the loss. I suppose that if we remove it the result wouldn't change.

No such file or directory: 'data/planetoid/ind.cora.x'

@tkipf thanks

python train.py
Traceback (most recent call last):
File "train.py", line 40, in
adj, features = load_data(dataset_str)
File "build/bdist.macosx-10.7-x86_64/egg/gae/input_data.py", line 19, in load_data
objects.append(pkl.load(open("data/planetoid/ind.{}.{}".format(dataset, names[i]))))
IOError: [Errno 2] No such file or directory: 'data/planetoid/ind.cora.x'

Graph-level autoencoder

Hi Kipf,

Thanks a lot for your excellent paper. I have two questions on GCNModelAE model.

  1. InnerProductDecoder can be seen as an inverse operation of the encoder operation (GraphConvolutionSparse() followed by GraphConvolution()). It's like encoder/decoder architecture in sequence autoencoder, convolutional autoencoder, or stacked denoising autoencoder. However, InnerProductDecoder() inverses GraphConvolution() only while skipping the inverse of GraphConvolutionSparse() in order to get self.reconstructions. If you look at any typical autoencoder, the reconstruction is the same dimension as the original input (NxD; where N is the number of training samples and D is the feature vector size-1433 in cora dataset for instance). Your example doesn't seem to do this second inverse operation in decoder. Is it because your example code tries to auto-encode against 'node topology' (not against node features)?

  2. Perhpas this question is related to my first question.
    I'd like to cluster a list of graphs based on node features as well as graph topology. I thought this could be done using GCN-AE. If it's possible, how do you suggest to implement decoder and cost function?

Your answers would be greatly appreciated.

spark

Data question

hi, tkipf:
I get a question about datasets. As far as I know, all three datasets you use are directed graph data. But in your experiment, its seems like undirected.

Can I get the same feature in this case?

Hi,
I got the features of the whole graph by extracting the features of each node and adding them together with gae, but now I have a question, I hope to get your help.
Now I have two graphs, and 60% of them have the same nodes (the same presentation features, the same adjacency relationship). But GCN processes one graph at a time, so I can only input two graphs into the network individually. If so, are 60% of the node feature of the two graphs the same? What can I do to ensure that 60% of the nodes have the same feature?

Why is roc computed using model.z_mean even in the VGAE case?

Hi,
first of all, thanks for your work.
I'm looking at the code to understand the training process, but I can't understand why you used

emb = sess.run(model.z_mean, feed_dict=feed_dict)

(line 107 in train.py) to compute the embeddings at test time for both models.
Shouldn't the VGAE use model.z or am I missing something?
Also, why do you recompute the decoding manually, rather than using the model output directly?
I think one question will probably aswer the other, but I really can't see it.

Thanks,
Daniele

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.