thudm / gatne Goto Github PK

View Code? Open in Web Editor NEW

517.0 16.0 144.0 9.54 MB

Source code and dataset for KDD 2019 paper "Representation Learning for Attributed Multiplex Heterogeneous Network"

License: MIT License

Python 99.89% Shell 0.11%

network-embedding heterogeneous-network representation-learning multiplex-networks attributed-networks

gatne's Introduction

GATNE

Project | Arxiv

Representation Learning for Attributed Multiplex Heterogeneous Network.

Yukuo Cen, Xu Zou, Jianwei Zhang, Hongxia Yang, Jingren Zhou, Jie Tang

Accepted to KDD 2019 Research Track!

❗ News

Recent Updates (Nov. 2020):

Use multiprocessing to speedup the random walk procedure (by --num-workers)
Support saving/loading walk file (by --walk-file)
The PyTorch version now supports node features (by --features)

Some Tips:

The PyTorch version may not reproduce the results (especially on the Twitter dataset). Please use the original TensorFlow version (src/main.py) for reproducing the paper results.
Running on large-scale datasets needs to set a larger value for batch-size to speedup training (e.g., several hundred or thousand).
If out of memory (OOM) occurs, you may need to decrease the values of dimensions and att-dim.

Our GATNE models have been implemented by many popular graph toolkits:

Deep Graph Library (DGL): see https://github.com/dmlc/dgl/tree/master/examples/pytorch/GATNE-T
Paddle Graph Learning (PGL): see https://github.com/PaddlePaddle/PGL/tree/main/examples/GATNE
CogDL: see https://github.com/THUDM/cogdl/blob/master/cogdl/models/emb/gatne.py

Some recent papers have listed GATNE models as a strong baseline:

Deep Adversarial Completion for Sparse Heterogeneous Information Network Embedding (WWW'20)
Decoupled Graph Convolution Network for Inferring Substitutable and Complementary Items (CIKM'20)
Graph Attention Networks over Edge Content-Based Channels (KDD'20)
Temporal heterogeneous interaction graph embedding for next-item recommendation (PKDD'20)
Link Inference via Heterogeneous Multi-view Graph Neural Networks (DASFAA 2020)
Multi-View Collaborative Network Embedding (Arxiv, May 2020)

Please let me know if your toolkit includes GATNE models or your paper uses GATNE models as baselines.

Prerequisites

Python 3
TensorFlow >= 1.8 or PyTorch

Getting Started

Installation

Clone this repo.

git clone https://github.com/THUDM/GATNE
cd GATNE

Please first install TensorFlow or PyTorch, and then install other dependencies by

pip install -r requirements.txt

Dataset

These datasets are sampled from the original datasets.

Amazon contains 10,166 nodes and 148,865 edges. Source
Twitter contains 10,000 nodes and 331,899 edges. Source
YouTube contains 2,000 nodes and 1,310,617 edges. Source
Alibaba contains 6,163 nodes and 17,865 edges.

Training

Training on the existing datasets

You can use ./scripts/run_example.sh or python src/main.py --input data/example or python src/main_pytorch.py --input data/example to train GATNE-T model on the example data. (If you share the server with others or you want to use the specific GPU(s), you may need to set CUDA_VISIBLE_DEVICES.)

If you want to train on the Amazon dataset, you can run python src/main.py --input data/amazon or python src/main.py --input data/amazon --features data/amazon/feature.txt to train GATNE-T model or GATNE-I model, respectively.

You can use the following commands to train GATNE-T on Twitter and YouTube datasets: python src/main.py --input data/twitter --eval-type 1 or python src/main.py --input data/youtube. We only evaluate the edges of the first edge type on Twitter dataset as the number of edges of other edge types is too small.

As Twitter and YouTube datasets do not have node attributes, you can generate heuristic features for them, such as DeepWalk embeddings. Then you can train GATNE-I model on these two datasets by adding the --features argument.

Training on your own datasets

If you want to train GATNE-T/I on your own dataset, you should prepare the following three(or four) files:

train.txt: Each line represents an edge, which contains three tokens <edge_type> <node1> <node2> where each token can be either a number or a string.
valid.txt: Each line represents an edge or a non-edge, which contains four tokens <edge_type> <node1> <node2> <label>, where <label> is either 1 or 0 denoting an edge or a non-edge
test.txt: the same format with valid.txt
feature.txt (optional): First line contains two number <num> <dim> representing the number of nodes and the feature dimension size. From the second line, each line describes the features of a node, i.e., <node> <f_1> <f_2> ... <f_dim>.

If your dataset contains several node types and you want to use meta-path based random walk, you should also provide an additional file as follows:

node_type.txt: Each line contains two tokens <node> <node_type>, where <node_type> should be consistent with the meta-path schema in the training command, i.e., --schema node_type_1-node_type_2-...-node_type_k-node_type_1. (Note that the first node type in the schema should equals to the last node type.)

If you have ANY difficulties to get things working in the above steps, feel free to open an issue. You can expect a reply within 24 hours.

Cite

Please cite our paper if you find this code useful for your research:

@inproceedings{cen2019representation,
  title = {Representation Learning for Attributed Multiplex Heterogeneous Network},
  author = {Cen, Yukuo and Zou, Xu and Zhang, Jianwei and Yang, Hongxia and Zhou, Jingren and Tang, Jie},
  booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year = {2019},
  pages = {1358--1368},
  publisher = {ACM},
}

gatne's People

Contributors

Stargazers

Watchers

Forkers

awesome-archive codeinging locussam jingmouren zhongbineden leran95 zxxxxxxxin angelasunny wangcheny lileilei66 hhh920406 lukebelieves yoohu skx300 xuetf tmacmilan chengli0327 voidmomo qzshadow songfgh yipeng5 louise-lulin juliadouble xuelun tomleung1996 changfengpolang xudamao2015 zhangkehao whuscity peisenli yuqingsheng zbn123 wangbo-xwfintech yu3401 flyingzhy xrosliang wilson-zhang shinesun130 cslele seeker1943 piaojh97 yea02 debrawang songzhen-neu spirit-dongdong littlebadrobot yiyg510 dsj96 zhangyan715 zwytop lalw young2019 wwwwwelkin maclarin pwforks lshhhhh lai-alien-z lighteningzhang mysqlsc tonygracious hhdo peacegui floatingmaple liuqinyi gaoyz0625 jiangquan8 hanchangzi baobunuo lijian10086 aabbccgithub kanglicheng aliang-cn huazai1992 pinkney03 ltkevin kiminh debuluoyi boomyunjuan takahirocecilhasegawa yiqingxyq amy-deng do-no-evil gl123456 oceansnape sherylhyx zhengyk11 xuptacm tashengjinsheng joneswong yangylin zbingbing-lava qiaojj yueyedeai sxxtyz wanglu2014 zhangjiekui siujohnjai diracmg3 feiyuxuelang datali01

gatne's Issues

关于数据集构建

以Amazon数据集为例，请问您是如何对数据进行预处理的呢？
（从Raw Data 转化为 node_type node_A node_B格式）
我的一个猜想：已知用户评分序列（按照时间先后）为A->B->C->D->E
生成node_type node_A node_B对为
1 A B
1 B C
1 C D
1 D E
请问您是这样处理的吗，请多多指教，谢谢！

when i use several node types,how to write feature.txt

questions about the amazon dataset.

Could you please explain how you preprocess the amazon dataset, like what type you choose, how to sample, and how to generate feature data, etc ..., thank you for your time:)

test

when i set schema,it goes wrong

Take only 300 person-item data for experiments, set the meta path to person-item-person, the number of walks and length are the default values, but this will give an error
line 56 in main.py
iy = vocab [y] .index KeyError: '1006114'
I printed "all_walks" , found that it did not walk to this node '1006114'
Is there something wrong with me?Looking forward to your reply

Is there any way to get the Alibaba-S dataset?

hi, thanks for your excellent work!
I want to do some experiments on the Alibaba-S dataset, I wonder if it can be released.

Regards.

question on negative sampling and equation 13

hi,
It is a very solid paper, but I meet some questions... could you give me some hints?

I see you use NCE loss in the code, however, there are many types of nodes, but NCE loss does not consider the type of negative sampling. Hence, during the negative sampling, you do not consider the type of nodes. Am I correct?

And another thing is that one context may have different words. When doing negative sampling, we only give the sampler one true label, so how the sampler prevent itself from sampling true labels? May this point not related to this work, but could you give me some hints? I can not find any materials on the website regarding this question.

The last question is about the equation 13, I see you use the D_z on the features of the target node.
Since there is a h_z focusing on the features of the target node, so is the D_z necessary? Do you do the experiment without D_z? is the performance not so good?

Thank you very much!

Validation set not used for parameter tuning

Novel idea and great work for generation node embedding for AMHEN graphs.

My doubt is according to the paper the validation set has to be used for parameter tuning, however in the code implementation the validation set has been used just for calculating performance metrics. Could you please clarify?

About attributes

hi , I have some questions about equation (13).
How the attributes of nodes in GATNE-I are applied to calculate bi and ui,r(0)?
How are hz and gz,r defined?
thank you !

增加节点类型

你好，请问增加node_type.txt后除了修改schma，还需要调整代码吗？还有可以用字符串表示节点吗？

generate_walks takes too long

When my number of nodes exceeds 100,000, it takes several days to complete generate_walks，it is too long! How to deal with large heterogeneous graphs？

Can GATNE be applied to weighted graph?

Hi！
Although I have already read through your paper, I just want to check with you whether this approach can be applied to weighted graph?

node embedding 和 k值问题

cold start nodes' embedding

When we have nodes which do not have edges (because of cold start) we currently do not get embeddings for them. Can we add the feature for getting the node embeddings even when they do not have edges which is the novel output from GATNE-I.

What's the meaning of att_head?

What's the meaning of variable 'att_head'(default 1), I don't find any information on it?
Plz, tell me, Thanks

About comparative experiments

Hello,
I want to do comparative experiments on my own dataset, but I didn't find the official python code of metapath2vec, so I'm sorry to ask, what is the metapath2vec code you are using for comparison experiments, can you share it?
Thank you very much!

Can you assign a release to this project please?

Please put a release number in Release.. Thanks and Appreciate

how to output every node embedding vector result to the file,like node2vec,etc.

weighted graph

Is it possible to execute your method for weighted graph?
Is there a theoretical problem why weighted heterogeneous network can't be vectorize?
I found some nice looking implementation of node2vec and matapath2vec, but there is no possible to set weights for edge in case of matapath2vec.
Thank you for answer !

What's the specific meaning of the four datasets

Hello,

I want to use your code and your datasets as a baseline method.

But the four datasets in the project just have the value of attributes, and don't have the specific meaning of them, thus I can't have a better understand of the datasets.

So could you please provide me the specific meaning of the attributes in the four datasets? Thank you.

D_{z}矩阵不是冗余的么

论文中公式13，D矩阵跟h映射，不是冗余的么，这里有什么考虑。代码中有这个D矩阵么

question on meth-path random walking

hi,

I see you first use the meta-path to sample nodes and get the base_walks. Then you use the base_walks to generate the vocabularies. Then, you generate all_walks according to the type of edge, and only remain these nodes of all_walks which are existed in the vocabulary.

I have two questions on it. (1) why you separate it into two steps, I mean why not use the meta-path sampling without obtaining the all_walks? (2) if we use the base_walks to generate the vocabularies, we will miss some nodes, and then we will not get the embedding of these missed nodes, so why need we to use base_walks instead of all_walks to generate vocabularies?

Could you give me some hints?
Thank you very much!

edge prediction

您好不好意思我有几个小问题

请问如果我想预测node1和node2之间哪个edges type的probability最高, 以致于可以预测edge type. 请问您有什么建议吗?
请问paper中的context embedding是怎么算出来的, 不好意思这块不是很明白, 谢谢

程序运行错误

Traceback (most recent call last):
File "src/main.py", line 415, in
average_auc, average_f1, average_pr = train_model(training_data_by_type, feature_dic, log_name + '_' + time.strftime('%Y-%m-%d %H-%M-%S',time.localtime(time.time())))
File "src/main.py", line 363, in train_model
tmp_auc, tmp_f1, tmp_pr = evaluate(final_model[edge_types[i]], valid_true_data_by_edge[edge_types[i]], valid_false_data_by_edge[edge_types[i]])
KeyError: '3'

你好，请问当我用自己的数据运行程序的时候，出现了这样的错误，是我的数据有问题吗？

edge embedding

您好
我可以从代码里的final_model拿到node embedding
请问有什么办法可以拿到edge embedding吗
感谢

feature.txt for graph containing 2 node types(user and item) with diff feature dims

如果社交图有两种节点：用户和图片，边类型可能是：点击，点赞，分享，评论等等
假设图片节点的特征向量维度和用户节点的特征向量维度不一样，feature.txt应该怎么写呢？

比如假设对于用户节点， f_dim = 200, 对于图片节点， f_dim = 100。

可以解决节点分类问题吗？

请问算法可以解决节点分类问题吗？

What method did you used to make link prediction?

When got the embedding of the nodes on every type of edges, I am not clearly with how to used the embedding to make the prediction? I would appreciate it if you could answer this question.

关于表格中网络类型的分类

请问一下，metapath2vec 和 PTE 放到 HEN （多类型节点，单类型边）是如何考虑的呢，这两个困扰我好久了，感谢

How to prepare data with node attributes for GATNE-I?

an example input .txt file for node atrributes?

Excellent work.

Hi @cenyk1230 ,

Could you share why you're so niubi?

Thanks.

KeyError: 'Base' when running the example

KeyError: 'Base' for CUDA_VISIBLE_DEVICES=0 python src/main.py --input data/example

Cosine similarity between generated vectors is close to 1.0?

Hi, thank you for your excellent work!

I have a question about the cosine similarity between the generated vectors of the nodes.

For Amazon dataset and one dataset of my own, threshold(https://github.com/THUDM/GATNE/blob/master/src/main.py#L100) printed during training is always greater than 0.98, and the final similarity score between any two nodes is close to 1, most of which is greater than 0.95, which seems that the generated vectors are distributed in a limited area and makes me very confused.

I change nothing about the experiment setups for Amazon dataset.

Do you have any idea about this result? Thanks ahead!

more than one schema

hi
How to set up if there is more than one schema between two nodes？
How can the model be used to calculate the correlation of nodes？

运行amazon例子的时候报错了

运行时出了这个错误
File "src/main.py", line 363, in train_model
tmp_auc, tmp_f1, tmp_pr = evaluate(final_model[edge_types[i]], valid_true_data_by_edge[edge_types[i]], valid_false_data_by_edge[edge_types[i]])
KeyError: 'Base'
sample运行的时候没出问题的时候edge_type变量是['1', '2', 'Base']
Amazon print出来的edge_type变量是['2', 'Base', '1']
请问这个问题该怎么解决啊……

Is Alibaba or Alibaba-S data set available？

how to use schema?

if graph content item nodes and user nodes[user1 item199，user2 item65，...，user89 item889]，what is each line of the node_type.txt，Can give concrete examples about -- schema and node_type.txt；if dataset contains several node types and do not use schema，i can distinguish node type by feature.txt (feature.txt columns can be: node_id user_age user_job item_size item_price. for user nodes,the item_size and item_price is 0), whether this method works？

Question on vertex embedding

Hi,
I find the paper very interesting and to read through.
However, I have a question regarding the final learned embedding.
In Algorithm 1, line 6, the model learns an embedding of a vertex specific to the relation, which is also evident through eq. 6 and 13. This means that a vertex has different embedding for each view.
I could not find or understand in experimental section which embedding are you using for link prediction, OR are you combining these different embedding, if yes, then how?

Thanks.

questions about the dataset.

hi, very lucky to study your work, I am curious about your dataset, would you please explain the preprocessed data, and share the meaning of each field.

请问支持节点连边权重吗？

您好：
我看了readme.md，发现里头准备数据一节中没有涉及到边的权重，请问是不支持连边权重的设置吗? 谢谢

大规模图如何处理

想请问一下，对于较大规模异构图，是如何处理的？（主要是前期random walk）现在采样了约80万节点，单机跑不动

Embedding for cold start nodes

The paper mentions a great way to tackle the cold start problem by considering the node attributes while generating the embeddings in case of GATNE-I. So ideally, a node that is not connected to any other node, should also have embedding based on it's features, so that for a new user, we can recommend on the basis of his attributes considering no previous connections.
Can you please point out at the part of code that deals with this problem?
Thank you in advance!

新的nodes

您好，请问这个模型是否要求train set里需要包括所有可能出现的nodes？换句话说，valid set和test set里不可以出现新的nodes？谢谢。

How to get the embedding vector for each node?

hi, very lucky to study your work! I found that the task in your code is to predict link relations between nodes, but I want to get the embedding result of each node. How can I get that?
Thank you.

heterogeneous softmax

请问一下目前的代码是不是在训练的时候没有考虑node type

1.inductive setting里的节点的feature转换矩阵h在论文中带有下标z ，是不是意味着h应该和node type相关，但是代码实现好像对于所有的节点都使用同一个转换矩阵。
2.目前的softmax是不是还是普通的softmax，没有考虑node type，代码中tf.nn.nce_loss函数里num_class对应的是所有的节点数目，对于heterogeneous softmax是不是应该要根据不同的node type，给出不同的num_class值啊。

感谢