awslabs / dgl-ke Goto Github PK

View Code? Open in Web Editor NEW

1.2K 27.0 196.0 4.86 MB

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.

Home Page: https://dglke.dgl.ai/doc/

License: Apache License 2.0

Python 88.01% Shell 3.68% Jupyter Notebook 8.31%

machine-learning knowledge-graph knowledge-graphs-embeddings graph-learning dgl

dgl-ke's Introduction

Documentation

Knowledge graphs (KGs) are data structures that store information about different entities (nodes) and their relations (edges). A common approach of using KGs in various machine learning tasks is to compute knowledge graph embeddings. DGL-KE is a high performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings. The package is implemented on the top of Deep Graph Library (DGL) and developers can run DGL-KE on CPU machine, GPU machine, as well as clusters with a set of popular models, including TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE.

Figure: DGL-KE Overall Architecture

Currently DGL-KE support three tasks:

Training, trains KG embeddings using dglke_train(single machine) or dglke_dist_train(distributed environment).
Evaluation, reads the pre-trained embeddings and evaluates the embeddings with a link prediction task on the test set using dglke_eval.
Inference, reads the pre-trained embeddings and do the entities/relations linkage predicting inference tasks using dglke_predict or do the embedding similarity inference tasks using dglke_emb_sim.

A Quick Start

To install the latest version of DGL-KE run:

sudo pip3 install dgl
sudo pip3 install dglke

Train a transE model on FB15k dataset by running the following command:

DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset FB15k --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 500 --log_interval 100 \
--batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --num_thread 1 --num_proc 8

This command will download the FB15k dataset, train the transE model and save the trained embeddings into the file.

Performance and Scalability

DGL-KE is designed for learning at scale. It introduces various novel optimizations that accelerate training on knowledge graphs with millions of nodes and billions of edges. Our benchmark on knowledge graphs consisting of over 86M nodes and 338M edges shows that DGL-KE can compute embeddings in 100 minutes on an EC2 instance with 8 GPUs and 30 minutes on an EC2 cluster with 4 machines (48 cores/machine). These results represent a 2×∼5× speedup over the best competing approaches.

Figure: DGL-KE vs GraphVite on FB15k

Figure: DGL-KE vs Pytorch-BigGraph on Freebase

Learn more details with our documentation! If you are interested in the optimizations in DGL-KE, please check out our paper for more details.

Cite

If you use DGL-KE in a scientific publication, we would appreciate citations to the following paper:

@inproceedings{DGL-KE,
author = {Zheng, Da and Song, Xiang and Ma, Chao and Tan, Zeyuan and Ye, Zihao and Dong, Jin and Xiong, Hao and Zhang, Zheng and Karypis, George},
title = {DGL-KE: Training Knowledge Graph Embeddings at Scale},
year = {2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {739–748},
numpages = {10},
series = {SIGIR '20}
}

License

This project is licensed under the Apache-2.0 License.

dgl-ke's People

Contributors

Stargazers

Watchers

Forkers

classicsong zheng-da trendingtechnology sherry-1001 karypis janeyzy hyqgod yushu-liu awesome-archive slayzzzzz jingmouren stevenlol sdxshuai allensmile jjwangnlp youciciyuxi zhangsuzerain restmad lumzen augiot fighting41love xiaming9880 gutengzczy liuyingkai panzhihao2011 kevinedison wallaceliu hongshunyang askme-gpt zhangruiskyline hydercps xrosliang awoziji alexmruch qianjunlang shersoni610 stonesusu wangmiao1981 huayu-zhang stjordanis jasonmoho analytics4business rtroncy xc15071347094 kenzhou97 chenjun0210 sublimotion shihanyang zhenqicool vovallen aksnzhy xiaotinghe ml-ai-nlp-ir kyawlin kdutia syyunn languageandintelligence mathildasu zhichun swamysriharshai newcooldiscoveries ncoop57 anyuanay mkbergman menjarleev animesh chdd jinyangchen nipi64310 justinrong dmccreary foreverqing willy20040711 coco11563 tonny-gu leafsheep xiangking alenegro81 acproject zdqf beesitech milkigit rpatil524 prateekchandrajha chengjn samanthvishwas lrpopeyou raifthenerd jcrangel ahaldar sebastiandro wxr185 shengguanwsu binkes xz-liu shunsunsun zby123 aspirincode ryantd liu-hy

dgl-ke's Issues

How to install from source

I noticed that the dglke_predict command is not available because pip installing gives me the stable version that doesn't include infer_score and the dglke_predict command.

How can I get around this?

Add Python API

The Python API is convenient for many use cases. It allows more customization and is very friendly for Jupyter Notebook users.

The KGE models cannot achieve good performance for small and dense datasets

It's likely that the default value of exclude_positive for the EdgeSampler is True. The design will not compromise the performance for large and sparse graphs. But for small and dense graphs, the results can be effected, because the sampled negative edges are more likely to be positive edges.

Neither dglke_emb_sim nor dglke_predict are appearing in pip3 installed version of dglke

Neither dglke_emb_sim nor dglke_predict are appearing in the pip3 installed version of dglke, even after I try installing in a new environment and after I uninstall all older versions of dglke. Are these methods only available through the github version? If so, when will these additions be pushed to pip3?

(dglke) amruch@wit:~/Projects/AmazonScience/graphs$ sudo pip3 install dglke
The directory '/home/amruch/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/amruch/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions andowner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting dglke
  Downloading https://files.pythonhosted.org/packages/9a/59/d9571eac71ef5e63784bbf4efa75bbe6803653e04057b774ce043a1b65e3/dglke-0.1.0-py3-none-any.whl (59kB)
    100% |████████████████████████████████| 61kB 885kB/s
Requirement already satisfied: setuptools in /home/amruch/.local/lib/python3.6/site-packages (from dglke)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from dglke)
Installing collected packages: dglke
Successfully installed dglke-0.1.0
(dglke) amruch@wit:~/Projects/AmazonScience/graphs$ dglke_
dglke_client      dglke_convert     dglke_dist_train  dglke_eval        dglke_partition   dglke_server      dglke_train

[Question] Implementing my own embedding in DGL-KE

Hello DGL-KE people!

I read the arxiv 2020 paper and I found it very interesting. It is great that the project is open source.

I am wondering if I can implement my own graph embedding algorithm on top of your system, which focuses only in knowledge graph embeddings. For example, can I implement DeepWalk in DGL-KE?

If yes, do you provide abstractions for defining random-walk operations in DGL-KE?
If no, is it possible to build such abstractions in your system?

Thanks in advance.

Best,
Makis

'std::bad_alloc' error when evaluating a dataset with large number of entities.

Hello! I'm running into this 'std::bad_alloc' error:

|test|: 17248443
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

I split my dataset into training, validation, and test datasets. I first used dglke_train and passed these three files to --data_files. It finished training successfully. But when I ran dglke_eval with these three files, it yielded this error.

I'm pretty sure I have enough space on the machine. Do you know what could be the possible problem? Also, I'm confused by the command line arguments --dataset and --data_files of dglke_eval. What's the usage of --dataset when running my own dataset? Should I pass the same files to --data_files for evaluation as those for training?

udd_ input format need more specific description

Add more info about: entity_ids should start from 0 and be continuous.
And other limitations.

model.predict_score(dt.g) error

when calling model.predict_score(dt.g) , I receive the following error:
/opt/conda/lib/python3.7/site-packages/dglke/models/pytorch/score_fun.py in (edges)
306
307 def forward(self, g):
--> 308 g.apply_edges(lambda edges: self.edge_func(edges))
309
310 def create_neg(self, neg_head):

/opt/conda/lib/python3.7/site-packages/dglke/models/pytorch/score_fun.py in edge_func(self, edges)
274
275 def edge_func(self, edges):
--> 276 real_head, img_head = th.chunk(edges.src['emb'], 2, dim=-1)
277 real_tail, img_tail = th.chunk(edges.dst['emb'], 2, dim=-1)
278 real_rel, img_rel = th.chunk(edges.data['emb'], 2, dim=-1)

/opt/conda/lib/python3.7/site-packages/dgl/utils.py in getitem(self, key)
282 def getitem(self, key):
283 if key not in self._keys:
--> 284 raise KeyError(key)
285 return self._fn(key)
286

KeyError: 'emb'

Is dgl-ke able to resume training?

Say I trained a graph but found it didn't reach the peak MRR. Can I take the embeddings I have now and continue training on them or do I have to start over?

Ranking measure Issue in general_model.py forward_test()

Hi,
I found that in the package /python/dglke/models/general_models.py as refferenced below:

dgl-ke/python/dglke/models/general_models.py

Line 455 in ff4da0b

rankings = F.sum(neg_scores >= pos_scores, dim=1) + 1

the line: rankings = F.sum(neg_scores >= pos_scores, dim=1) + 1 seems to exclude the positive sample itself out of the top 1 hit while it compare with itself , which will cause the top 1 hit is almost always near to 0%
I suppose that the ">=" should be changed as ">" like below :
rankings = F.sum(neg_scores > pos_scores, dim=1) + 1

Is that right?

Literal embedding and Graph-BERT support?

Hi, guys, this looks like a great start of a powerful knowledge graph embedding libraray! Thanks for sharing it!
My question is: many practical applications involve knowledge graphs with mixture of structured (knowledge graph itself) and unstructured literal data (attributes like description, name, date, and etc). Do you have any plan to support literal-enhanced embedding like LiteralE as well? thanks.
The other interesting development is Graph-BERT, where attention on local subgraph is used instead of GCN. It seems to be more scalable. Will you coonsider supporting it?
ps. both algorithms already have their source code available right now.

Training process get killed if --valid is turned on

When --valid / --test is turned on, the training process will get killed during evaluation. To solve this, --batch_size_eval and --neg_sample_size_eval should be the same value.

Something wrong with RotatE score function

Thanks for contributing such wonderful work! There is an error when I run RotatE using command "DGLBAKEND=pytorch dglke_train --model_name RotatE", which shows:
File ".../dglke/models/pytorch/score_fun.py", line 423, in edge_func
re_score = re_head * re_rel - im_head * im_rel
RuntimeError: The size of tensor a (200) must match the size of tensor b (400) at non-singleton dimension 1
In the paper RotatE it only rotates the head and it seems there is some change in this implement, so I don't know how to fix it. I wonder if you could help me.

Could DGL-KE be applied to undirected graph?

hi,

This is a great project and recently our team have tried to run some demo programs.
Now I have a question, TransE, for example, is mostly used for Heterogeneous Graphs which contain different ralations, but could this TransE applied to undirect graph? Undirected Graph only contain one kind of relation and I think it's a special case of Heterogeneous Graphs, so actually I believe that TransE could used here, but how about the performance compared with those embedding algorithm used only for undirected graph(node2vec for instance)? As TransE would create embedding vector for the relation though there is only one kind of relation for undirected graph and this relation embedding is indeed useless for undirect graph.
thx.

Check the correctness of node/relation mappings if users provide such mappings.

DGL-KE needs to verify the mappings and report more informative errors if the input data fails the checks.

Why DGL-KE is much faster than GraphVite and PBG?

According Performance and Scalability in README, DGL-KE is much faster than GraphVite and
PBG, even with single GPU. Since GraphVite also using GPUs to speedup, what are the main reasons for these improvements?

[Question] --batch_size_eval

Is it possible to set this flag to all edges? I mean not by specifying the exact number, but some flags like "all_edges."

Question about TransE HITS@10 on FB15K dataset

Hi, I found that your hit@10 result on transE is 0.8x, but in fact the original paper is 0.4x, I am puzzled about this.

Option to manually set random seed globally

Hi!

Thanks for this awesome package! I'm wondering if there is any option available to fix the manual seed so I can reproduce same results across different trainning outputs. Currently I try to manually set the random seeds for pytorch and numpy under train_pytorch.py and dataloader/sampler.py but the final output embeddings of multiple trainning attempts are still different. Is there any workaround for this?

Thanks for any help in advance.

Add visualization

It's useful to plot the entities after training their embeddings. This helps us verify the training results.

Add the ranking loss

Currently, the KGE models are trained with the logistic loss. However, it's desirable to train the models with a pair-wise ranking loss because in a lot of cases negative edges can potentially be missing edges. The logistic loss cannot handle this case.

support few-shot link prediction

Few-shot link prediction proposed by this paper seems a useful technique to support in DGL-KE.

Doubt in score function

Thanks for this awesome library!
I have a confusion in the score function for the TransE model which the library uses. The edge score is given in https://github.com/awslabs/dgl-ke/blob/master/python/dglke/models/pytorch/score_fun.py#L59 as (gamma - norm(h+r-t)). In the forward() method of the KE model https://github.com/awslabs/dgl-ke/blob/master/python/dglke/models/general_models.py#L346, the final score is the sum of logsigmoid of positive score and mean of logsigmoid of negative score. Could you please point me to a paper that uses this particular loss function. It seems to be a combination of margin based loss and logistic loss.

Error happened when changing build-in dataset into my own

Firstly, run the following command and successfully done
dglke_train --model_name TransE_l1 --dataset FB15k ......

Since FB15k has existed, I want to train directly without downloading. Then, I change the --dataset FB15k into:
dglke_train --model_name TransE_l1 --data_path ./data/FB15k/ --data_files entities.dict relations.dict train.txt valid.txt test.txt --format udd_hrt .....

However, the terminal gives me an error:

Using backend: pytorch
Traceback (most recent call last):
  File "/home/hjhuang/anaconda3/envs/dgl/bin/dglke_train", line 33, in <module>
    sys.exit(load_entry_point('dglke==0.1.0.dev0', 'console_scripts', 'dglke_train')())
  File "/home/hjhuang/anaconda3/envs/dgl/lib/python3.6/site-packages/dglke-0.1.0.dev0-py3.6.egg/dglke/train.py", line 81, in main
  File "/home/hjhuang/anaconda3/envs/dgl/lib/python3.6/site-packages/dglke-0.1.0.dev0-py3.6.egg/dglke/dataloader/KGDataset.py", line 603, in get_dataset
AssertionError: You should provide the dataset name for raw_udd format.

Could I make it report MRR every epoch?

I would like to see the curve of MRR during the training process so that I can get the peak MRR, instead of the MRR at the end of the training.

support different delimiter in the input data

Users' input data may have different delimiters. We should allow users to specify the delimiter.

no entity 2 id data

When I tried to run predict (not from command line), the model requires entity.dict data that map a entity name to it's id. However there is no such data in the package. Can you please provide entity 2 id data. Thanks

justification for choice of --neg_sample_size_eval

I understand that when working with larger KGs, it's convenient to set a neg_sample_size_eval that's smaller than num_entities in order to speed up link prediction in evaluation.

However, having a smaller neg_sample_size_eval also means potentially inflating link prediction results (if I evaluate each positive triple against only 1000 negative triples among 200k potential negative samples, the model will likely get higher evaluation metrics). So there's a tradeoff between faster evaluation and less biased evaluation metrics.

How do you deal with this tradeoff and justify using a certain value for the hyperparameter? Do we just assume that neg_sample_size_eval of 10000 is sufficient to approximate the true metrics?

killed without trace back

Hello!

When using the framework, I encountered the following problem:
This process was automatically killed, and the error report did not provide effective information, so I hope you can help.

Thanks in advance

(dglke_env) YCHABOT@25f55d9cb3bd:~/WikiData$  dglke_train --model_name TransE_l2 --data_path ~/WikiData/ \
> --dataset wikidata_input_nohead.csv --delimiter , \
> --data_files ~/WikiData/wikidata_input_nohead.csv ~/WikiData/wikidata_test.csv ~/WikiData/wikidata_valid.csv \
> --format raw_udd_hrt --batch_size 512 --log_interval 1000 --neg_sample_size 25600 --batch_size_eval 25600 \
> --regularization_coef=1e-9 --hidden_dim 300 --gamma 19.9 --lr 0.25 --batch_size_eval 16 --test -adv \
> --gpu 0 1 2 3 --max_step 6000 --async_update
Reading train triples....
Finished. Read 491065553 train triples.
Reading valid triples....
Finished. Read 70000 valid triples.
Reading test triples....
Finished. Read 70000 test triples.
|Train|: 491065553
random partition 491065553 edges into 4 parts
part 0 has 122766389 edges
part 1 has 122766389 edges
part 2 has 122766389 edges
part 3 has 122766386 edges
/home/YCHABOT/anaconda3/envs/dglke_env/lib/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
|valid|: 70000
|test|: 70000
Killed

Evaluation scheme for larger KBs

In the paper, section 5.3, it is mentioned that for the entire Freebase KB, you use a modified evaluation strategy in which

we use only 2000 negative triplets; 1000 sampled uniformly from the entire set of negative
samples and 1000 sampled proportionally to the degree of the corrupted entities;

Is this evaluation supported in the dgl-ke library? If so, which arguments can be used to enable this?

Setup-issue-Windows-10

Environment:
Windows -10 (only CPU)
torch: 1.5.0
dgl: 0.4.3
dgl-ke: built from source code.

Encountered below error while trying to execute a command given in the tutorial.

dglke_train --model_name TransE_l2 --dataset FB15k --batch_size 1000
--neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 500 --log_interval 100
--batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --num_thread 1 --num_proc 8

Logs are being recorded at: ckpts\TransE_l2_FB15k_3\train.log
Reading train triples....
Finished. Read 483142 train triples.
Reading valid triples....
Finished. Read 50000 valid triples.
Reading test triples....
Finished. Read 59071 test triples.
|Train|: 483142
random partition 483142 edges into 8 parts
part 0 has 60393 edges
part 1 has 60393 edges
part 2 has 60393 edges
part 3 has 60393 edges
part 4 has 60393 edges
part 5 has 60393 edges
part 6 has 60393 edges
part 7 has 60391 edges

Using backend: pytorch
C:\Users\riz\Anaconda3\lib\site-packages\dgl\base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
Traceback (most recent call last):
File "C:\Users\riz\Anaconda3\Scripts\dglke_train-script.py", line 11, in
load_entry_point('dglke==0.1.0.dev0', 'console_scripts', 'dglke_train')()
File "C:\Users\riz\Anaconda3\lib\site-packages\dglke-0.1.0.dev0-py3.7.egg\dglke\train.py", line 129, in main
File "C:\Users\riz\Anaconda3\lib\site-packages\dglke-0.1.0.dev0-py3.7.egg\dglke\dataloader\sampler.py", line 368, in create_sampler
File "C:\Users\riz\Anaconda3\lib\site-packages\dgl\contrib\sampling\sampler.py", line 660, in init
self._seed_edges = utils.toindex(self._seed_edges)
File "C:\Users\riz\Anaconda3\lib\site-packages\dgl\utils.py", line 242, in toindex
return data if isinstance(data, Index) else Index(data)
File "C:\Users\riz\Anaconda3\lib\site-packages\dgl\utils.py", line 15, in init
self._initialize_data(data)
File "C:\Users\riz\Anaconda3\lib\site-packages\dgl\utils.py", line 22, in _initialize_data
self._dispatch(data)
File "C:\Users\riz\Anaconda3\lib\site-packages\dgl\utils.py", line 47, in _dispatch
raise DGLError('Index data must be an int64 vector, but got: %s' % str(data))
dgl._ffi.base.DGLError: Index data must be an int64 vector, but got: tensor([ 68037, 423679, 381929, ..., 440877, 26144, 464339],
dtype=torch.int32)

I am not sure whether it's a bug or I missed something in the setup. Can someone please help me to resolve this?

Advice on Limiting Memory

It doesn't seem like there are a lot of options for limiting memory consumption in dgl-ke at the moment, so I was wondering if you have any suggestions for my problem. Presently, my model is running out of ram at

[proc 0][Train](12000/12000) average regularization: 0.00017675260825490114
[proc 0][Train] 1000 steps take 12.623 seconds
[proc 0]sample: 2.133, forward: 5.806, backward: 2.516, update: 2.070
proc 0 takes 161.118 seconds
training takes 162.84567785263062 seconds
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/pytorch/tensor_models.py", line 77, in decorated_function
    raise exception.__class__(trace)
RuntimeError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/pytorch/tensor_models.py", line 65, in _queue_result
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dglke/train_pytorch.py", line 238, in test_mp
    test(args, model, test_samplers, rank, mode, queue)
  File "/usr/local/lib/python3.6/dist-packages/dglke/train_pytorch.py", line 214, in test
    model.forward_test(pos_g, neg_g, logs, gpu_id)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/general_models.py", line 321, in forward_test
    neg_deg_sample=self.args.neg_deg_sample_eval)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/general_models.py", line 243, in predict_neg_score
    neg_head = self.entity_emb(neg_head_ids, gpu_id, trace)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/pytorch/tensor_models.py", line 203, in __call__
    s = self.emb[idx]
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 16751475200 bytes. Error code 12 (Cannot allocate memory)

The above was run with

DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv \
--format udd_hrt \
--model_name ComplEx \
--max_step 50000 --batch_size 1000 --neg_sample_size 200 --batch_size_eval 16 \
--hidden_dim 400 --gamma 19.9 --lr 0.25 --regularization_coef=1e-9 -adv \
--gpu 0 1 --async_update --force_sync_interval 1000 --log_interval 1000 \
--test

And has the following number of canonical tuples

Reading train triples....
Finished. Read 91802780 train triples.
Reading valid triples....
Finished. Read 10200309 valid triples.
Reading test triples....
Finished. Read 11333677 test triples.
|Train|: 91802780
random partition 91802780 edges into 2 parts
part 0 has 45901390 edges
part 1 has 45901390 edges

My machine has two 1080 Ti GPUs and 128GB of RAM. So this pretty much used up all the RAM right away, which is odd because the graphvite run on this knowledge graph finished fine (but took ~8 hours).

Reference RGCN implementation for multi-gpu training

Hi,
Is there any reference implementation of RGCN on knowledge graphs for multi-gpu training in DGL-KE?
Alternatively, in DGL, I can see some reference RGCN implementation for link prediction task, but I think that is not multi-gpu (may I confirm if that is correct?). Do you have some suggestions on how I can run RGCN on a multi-gpu setting?
Thanks for some pointers on this issue!

How to do negative sampling with type constraints?

Hi, thanks for putting this library together. I will put a feature request together in a similar format to the dgl repo:

🚀 Feature

Negative sampling with type constraints in dgl.contrib.sampling.EdgeSampler (via dataloader.sampler.TrainDataset).

Motivation

When using EdgeSampler to sample negative edges in knowledge graph link prediction, it would be useful to incorporate domain-specific type constraints. For example, edges (relations) in a KG are often typed (only specific entity types can slot into the head or tail entities), so an EdgeSampler that only samples negative edges by selecting head/tail nodes from a subset of all possible entities would greatly help.

Alternatives

One idea I had was to create different EdgeSampler objects for relations and then batch the graph based on relations. That way when sampling a mini-batch we are guaranteed that all facts in the batch have the same relation type and can apply the same EdgeSampler object to get negative samples. But it seems doing this requires diving into the C++ sampler code.

Another alternative is a two-step sampling procedure in training where I first a) sample positive edges only without replacement and then b) based on the relation types in the positive edges, sample negative edges from the specific EdgeSampler with replacement. This seems to be cleaner but also somewhat inefficient. Are there other disadvantages to this?

Any guidance and tips on how best to implement this would be great. I'd be happy to contribute it back to the repo.

Pitch

Similar functionality to how type constraints work in OpenKE.

Input parameters?

What's your input argument to get the result shown in /examples/README.md, on the FB15K dataset?

Add advanced hyperparameter tuning

The technique described in the paper "AutoNE: Hyperparameter Optimization for Massive Network Embedding" is interesting. Similar techniques should be incorporated into DGL-KE to tune hyperparameters on large knowledge graphs effectively.

The Bin Path Issue in dist_train.py line 79

Hi,
I noticed that when use the distribution train, when launch the kvstore server and client, it seems that
the bin path is fixed to within "/usr/local/bin:/bin:/usr/bin:/sbin/" in dglke/dist_train.py like below:

dgl-ke/python/dglke/dist_train.py

Line 79 in ad4be69

os.environ['PATH'] = '/usr/local/bin:/bin:/usr/bin:/sbin/'

however, if the user's develop environment is conda, is may cause some problem.
For me ,I use conda as my python environment,
so the two executable files named dglke_server and dglke_clinet is automatically loacated in "/root/miniconda3/bin/" directory , so it will cause this two files can't be launch correctly.

I suggest that ,maybe when it comes to the path setting, it should add some code to find the right path of dglke's running executable files like dglke_server etc.

Is that right?

The comparison of speed in document

I would like to know the more detail about time cost，is time that the model train each epoch?

Option to specify input format using column indices

Allow to directly specify the relevant column indices of the input files (e.g. triplets_column_indices=[1, 0, 2]):
Now you have to specify the format htr, rht, etc. which is converted internally with _parse_srd_format to [0,1,2], [1,0,2], etc.
The advantage of specifying this directly is that it would also allow input files with unused columns (such as qualifiers or sources).

It would also be great if this is possible for the id mapping files.
The dataset that I want to use has the columns: property_id, en_label, en_description.
This cannot be loaded with the code from this pull request, since the label and id are in the wrong order, and there is an unused column.
Specifying something like relations_map_column_indices=[1,0] would be very convenient.

eval.log output

What's the output of eval.log supposed to be? When I run this on my user-defined knowledge graph, it just saves an empty text file.

Here's an example of the script I run for eval:

DGLBACKEND=pytorch dglke_eval \
--data_path results_SXSW_2018 --dataset SXSW2018 \
--data_files entities.tsv relations.tsv all_ctups_40.tsv valid.tsv test.tsv --format udd_hrt \
--model_name ComplEx \
--hidden_dim 128 --gamma 128 \
--mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 \
--batch_size_eval 1024 --neg_sample_size_eval 10000 --eval_percent 20 \
--model_path /home/amruch/graphika/kg/ckpts/SXSW2018_ComplEx_20200511/

The dataset name in conf is wrong when users use their own dataset.

Please refer to this for more details: #84 (comment)

[Roadmap] v0.1.1 release plan

v0.1.1 Release

Although it is a minor release, but we will introduce several interesting features:

Feature enhancement

Offline inference support
Support different delimiters for UDD

Fix bugs

Issue #85
Issue #99
Issue #97

Documentation enhancement

Installation Guidelines
Knowledge graphs and embeddings
DGL-KE Toolkit Command Line Interface
DGL-KE Input/output format
Tips of Training and Evaluating Large Dataset

Add documents for command line arguments.

We need to explain the arguments of commands.

Different test results by using dgl_eval

Hi there,

I found I couldn't reproduce the test results after training.

After training, I got:

-------------- Test result --------------
Test average MRR : 0.2607070246959733
Test average MR : 847.37625
Test average HITS@1 : 0.16041666666666668
Test average HITS@3 : 0.2991666666666667
Test average HITS@10 : 0.45958333333333334

But when I use dgl_eval command to evaluate the saved embeddings, I got:

-------------- Test result --------------
Test average MRR: 0.09140389248502755
Test average MR: 5974.63
Test average HITS@1: 0.034583333333333334
Test average HITS@3: 0.08208333333333333
Test average HITS@10: 0.21875

The command I use is:

dglke_eval --model_name RotatE --dataset Mydata --hidden_dim 200 --gamma 12.0 --batch_size_eval 16
--gpu 0 1 2 3 4 5 6 7 --model_path ./ckpts/Mydata/RotatE_Mydata_0 --data_path ./Mydata/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt

Could you please help me figure out this?

Besides, I also encountered the out-of-memory issue on a larger dataset using dgl_eval command, but it works fine on the same amount of GPUs using dgl_train.

Thanks.

Triplet classification

Hello,

Will triplet classification be implemented as one of the evaluation tasks?
How would you recommend one go about implementing it using dglke?

Training stops without error

I'm trying to find an optimal set of parameters by running dglke_train with various sets of parameters (randomly sampled), and on the first instance it keeps freezing at the same iteration.

I'm running the test through the exclamation (!) mode in a Jupyter notebook, so I can loop through different sampled parameters.

Repeating the test

I'm using a custom dataset but these are the parameters:

model = TransE_l1
LOG_INTERVAL=1000
BATCH_SIZE=1000
BATCH_SIZE_EVAL=16
NEG_SAMPLE_SIZE=200
NEG_SAMPLE_SIZE_EVAL=100000
LR= 0.1 
-adv= True 
hidden_dim= 50
regularization_coef= 2e-08
gamma= 10
neg_deg_sample=False

More info

Here are the last few steps before it freezes (for 10 minutes before I cancel it)

[proc 0][Train] 1000 steps take 8.256 seconds
[proc 0]sample: 1.353, forward: 4.006, backward: 1.711, update: 1.175
[proc 0][Train](35000/60000) average pos_loss: 0.19853664480149746
[proc 0][Train](35000/60000) average neg_loss: 0.2785924620358273
[proc 0][Train](35000/60000) average loss: 0.23856455320119857
[proc 0][Train](35000/60000) average regularization: 0.00012192570248589618
[proc 0][Train] 1000 steps take 8.269 seconds
[proc 0]sample: 1.278, forward: 3.986, backward: 1.712, update: 1.283
[proc 0][Train](36000/60000) average pos_loss: 0.19503579252958297
[proc 0][Train](36000/60000) average neg_loss: 0.27933850078843536
[proc 0][Train](36000/60000) average loss: 0.23718714690953493
[proc 0][Train](36000/60000) average regularization: 0.00012245436408556996
[proc 0][Train] 1000 steps take 8.305 seconds
[proc 0]sample: 1.346, forward: 4.012, backward: 1.712, update: 1.224
[proc 0][Train](37000/60000) average pos_loss: 0.19615361012518406
[proc 0][Train](37000/60000) average neg_loss: 0.27748048058338465
[proc 0][Train](37000/60000) average loss: 0.2368170451670885
[proc 0][Train](37000/60000) average regularization: 0.00012362484454206423
[proc 0][Train] 1000 steps take 8.305 seconds
[proc 0]sample: 1.270, forward: 3.999, backward: 1.733, update: 1.293
[proc 0][Train](38000/60000) average pos_loss: 0.19601027159392834
[proc 0][Train](38000/60000) average neg_loss: 0.2794102805918083
[proc 0][Train](38000/60000) average loss: 0.23771027632802724
[proc 0][Train](38000/60000) average regularization: 0.00012375975443137578
[proc 0][Train] 1000 steps take 8.283 seconds
[proc 0]sample: 1.310, forward: 3.903, backward: 1.766, update: 1.294
[proc 0][Train](39000/60000) average pos_loss: 0.19360717238485814
[proc 0][Train](39000/60000) average neg_loss: 0.2766080161612481
[proc 0][Train](39000/60000) average loss: 0.23510759409517049
[proc 0][Train](39000/60000) average regularization: 0.0001251919507922139
[proc 0][Train] 1000 steps take 8.287 seconds
[proc 0]sample: 1.269, forward: 3.998, backward: 1.742, update: 1.268
[proc 0][Train](40000/60000) average pos_loss: 0.19862385678291322
[proc 0][Train](40000/60000) average neg_loss: 0.279490821111016
[proc 0][Train](40000/60000) average loss: 0.2390573388412595
[proc 0][Train](40000/60000) average regularization: 0.00012537073031126055
[proc 0][Train] 1000 steps take 8.190 seconds
[proc 0]sample: 1.236, forward: 3.902, backward: 1.749, update: 1.293
[proc 0][Train](41000/60000) average pos_loss: 0.19015826864540578
[proc 0][Train](41000/60000) average neg_loss: 0.27666417042165997
[proc 0][Train](41000/60000) average loss: 0.23341121918708085
[proc 0][Train](41000/60000) average regularization: 0.00012650544225471094
[proc 0][Train] 1000 steps take 8.237 seconds
[proc 0]sample: 1.311, forward: 3.908, backward: 1.717, update: 1.291
[proc 0][Train](42000/60000) average pos_loss: 0.19738745559751988
[proc 0][Train](42000/60000) average neg_loss: 0.279010270354338
[proc 0][Train](42000/60000) average loss: 0.23819886273890734
[proc 0][Train](42000/60000) average regularization: 0.0001268844535225071
[proc 0][Train] 1000 steps take 8.367 seconds
[proc 0]sample: 1.301, forward: 4.038, backward: 1.755, update: 1.263
[proc 0][Train](43000/60000) average pos_loss: 0.19044273269176484
[proc 0][Train](43000/60000) average neg_loss: 0.2760635534534231
[proc 0][Train](43000/60000) average loss: 0.23325314317643642

Not sure why this is happening.

DGL Assert fails breaks when setting `exclude_positive=True` in Sampler

I needed to make some changes to the train script, in particular I wanted to change the head and tail samplers in training to exclude_positive=True here and here, but as soon as I do that this assert in dgl fails:

python3: /opt/dgl/src/graph/sampler.cc:1186: dgl::NegSubgraph dgl::{anonymous}::EdgeSamplerObject::genNegEdgeSubgraph(const dgl::Subgraph&, const string&, int64_t, bool, bool): Assertion `prev_neg_offset + neg_sample_size == neg_vids.size()' failed.

Am I using the sampler incorrectly here?

Why does my dglke_eval not work?

When I train the TransE using the giving command and eval the model, the evaluation function dosen't work without error.

The loss function issue which is not the same as TransE

Hi,
These days I have been using the out-of-box TransE algorithm come with DGL-KE , thanks for your excellent and kind work !
However, I also encountered a quesion about the loss funciton while I am tracing down to the source code about it in the:
dklke/models/general_models.py in method forward, lines between 370 and 399 as figures listed below:

It seems that it's NOT consistent with the loss function described in the paper on dgl-ke's official github homepage, as the figure showed below:

In this paper, the loss you author declared to be usued has just these 2 forms as below:

but is not the same with the implemented as I mentioned above in dgk-ke's source code,
so I'm wondering that why the source code of general_models.py has changed the loss form?
Dose it make any improvement compared with the oringinal two kind of loss function in your paper?

Looking forward your reply

GNN Models

Hi,
Thanks for sharing this library for distributed training.
Is there any plan to add GNN models such as RGCN?
What needs to be done to add GNN models and train them in a distributed environment.

Thanks