xgfs / verse Goto Github PK

View Code? Open in Web Editor NEW

128.0 15.0 22.0 44 KB

Reference implementation of the paper VERSE: Versatile Graph Embeddings from Similarity Measures

Home Page: http://tsitsul.in/publications/verse/

License: MIT License

Python 18.79% Shell 0.11% Makefile 1.16% C++ 79.93%

verse's Introduction

VERSE: Versatile Graph Embeddings from Similarity Measures

This repository provides a reference implementation of VERSE as well as links to the data.

Installation and usage

We make VERSE available in two forms: fast, optimized C++ code that was used in the experiments, and more convenient python wrapper. Note that wrapper is still experimental and may not provide optimal performance.

For C++ executables:

cd src && make;

should be enough on most platforms. If you need to change the default compiler (i.e. to Intel), use:

make CXX=icpc

VERSE is able to encompass diverse similarity measures under its model. For performance reasons, we have implemented three different similarities separately.

Use the command

verse -input data/karate.bcsr -output karate.bin -dim 128 -alpha 0.85 -threads 4 -nsamples 3

to run the default version (that corresponds to PPR similarity) with 128 embedding dimension, PPR alpha 0.85, using 3 negative samples.

Graph file format

This implementation uses a custom graph format, namely binary compressed sparse row (BCSR) format for efficiency and reduced memory usage. Converter for three common graph formats (MATLAB sparse matrix, adjacency list, edge list) can be found in the python directory of the project. Usage:

$ convert-bcsr --help
Usage: convert-bcsr [OPTIONS] INPUT OUTPUT

  Converter for three common graph formats (MATLAB sparse matrix, adjacency
  list, edge list) can be found in the root directory of the project.

Options:
  --format [mat|edgelist|weighted_edgelist|adjlist]
                                  File format of input file
  --matfile-variable-name TEXT    variable name of adjacency matrix inside a
                                  .mat file.
  --undirected / --directed       Treat graph as undirected.
  --sep TEXT                      Separator of input file
  --help                          Show this message and exit.

--format adjlist for an adjacency list, e.g:

 1 2 3 4 5 6 7 8 9 11 12 13 14 18 20 22 32
 2 1 3 4 8 14 18 20 22 31
 3 1 2 4 8 9 10 14 28 29 33
 ...

--format edgelist for an edge list, e.g:
```
 1 2
 1 3
 1 4
 ...
```
--format weighted_edgelist for an edge list, e.g:
```
 1 2 0.1
 1 3 2
 1 4 0.5
 ...
```
--format mat for a Matlab MAT file containing an adjacency matrix (note, you must also specify the variable name of the adjacency matrix --matfile-variable-name)

Working with embeddings in Python

Michael Loster provided an example of working with the embedding file from Python. After learning the embeddings the saved binary file can be used the following way:

# The binary file that is the output of the compiled verse binary.
embedding_file = "/path/to/binary/embeddings.bin"

# An optional csv that should contain the mapping of id to some string key.
# E.g., each line should look like "0,http://dbpedia.org/resource/Audi".
index_file = "/path/to/uri/id/mapping.csv"

# Our embeddings have 128 dimensions.
embeddings = Embedding(embedding_file, 128, index_file)
audi_embedding = embeddings['http://dbpedia.org/resource/Audi']

Citation

If you use the code or the datasets, please consider citing the paper:

@inproceedings{Tsitsulin:2018:VVG:3178876.3186120,
    author = {Tsitsulin, Anton and Mottin, Davide and Karras, Panagiotis and M\"{u}ller, Emmanuel},
    title = {VERSE: Versatile Graph Embeddings from Similarity Measures},
    booktitle = {Proceedings of the 2018 World Wide Web Conference},
    series = {WWW '18},
    year = {2018},
    isbn = {978-1-4503-5639-8},
    location = {Lyon, France},
    pages = {539--548},
    numpages = {10},
    url = {https://doi.org/10.1145/3178876.3186120},
    doi = {10.1145/3178876.3186120},
    acmid = {3186120},
    publisher = {International World Wide Web Conferences Steering Committee},
    address = {Republic and Canton of Geneva, Switzerland},
    keywords = {feature learning, graph embedding, graph representations, information networks, node embedding, vertex similarity},
}

Contact

echo "%7=87@=<2=<>5.27" | tr '#-)/->' '_-|'

verse's People

Contributors

Stargazers

Watchers

verse's Issues

fverse implement in this repository?

Hi Authors,
In this repository has the implementation of fverse?

[Question] How is the similarity matrix calculated using PPR?

Hi Authors,

I am trying to understand how the similarity matrix using PPR is calculated in the implementation. Which part of the source code is implementing the similarity matrix calculation using PPR(my C++ knowledge is not that good)? Most of the resources for PPR calculation return 1 × V but as per your paper, the similarity matrix should be V × V → R. So I am trying to understand how it was calculated to be V × V using PPR. I am trying to implement the algorithm using Pytorch and DGL. Any help is much appreciated.

How to set the value of epoch count ?

Hello Authors,

I was going through the VERSE paper and code. I am curious about one parameter 'n_epochs'. The value for this parameter is '100000' as per the Experiments in the paper, and also in the code. Is there any specific reason for using this value, or can i reduce this value when working with a bigger graph (with around 1 Million nodes).

It would be kind enough, if you can please elaborate this.

Regards,
Sidhant

question on embeddings

I'm trying to find out what I can do with the embeddings.
Would it make sense to build a graph where sentences are nodes and edges between these nodes exist if some words are overlapping and next use verse to obtain sentence embeddings? Or is this a stupid idea?

question on node classification

I had a look at the paper of VERSE but can't understand what is done in case of node classification.
Are the embeddings used using liblinear to do a binary yes/no classification or what is being used there?
Many thanks.

why not use c

Hi Authors,

I have a question. why not use c to complete converting raw dara to binary csr matrix?

For large amounts of data, using Python makes me waste too much time.

Hope that helps.

dywlegend

Segmentation fault while trying to run verse

I created a bcsr file using the convert python script for a graph with 23,506,056 nodes and 77,166,666 edges. I get a segmentation fault while trying to run verse training. Any suggestions as to how do I enable more logging to identify what's causing this?

Command used -
./verse -input ../python/out.bcsr -output embed.bin -dim 128 -alpha 0.85 -threads 32 -nsamples 3

map the nodes to embeddings

I converted a graph represented with its weighted edge list to the bcsr format using your convert code and got the embeddings back. How can I know which node is associated with which embedding? is the ordering the same as the order in the edge list? meaning that if the first line of the edge list shows nodes 10 is connected to node 2 with the weight 0.5 (10,2,0.5), then the first embedding is associated with node 10 and the second is for node 2?

Not-so-good clustering in experiments

Hi!
I tried to use VERSE to visualize a not-so-large (nv: 23463, ne: 35923) well-clustered graph. I used PPR version with --dim 2 (Total steps (mil): 2346.3) and then used two dimensions as x and y (after normalization) and pre-calculated cluster IDs (Louvain method) as colour to visualize the embedded graph.
I ended up this:

While I was expecting a visualization in which all clusters separated perfectly, as in example shown in your article.
Any idea about which config should I use or what was wrong with my procedure?

Weight type in converter

In readme example for weighted_edgelist, weights are float. I get AssertionError: negative weights are not allowed when converted my graph to bcsr, but there was only positive weights. In converter.py weights are converted to np.int32, so problem that I get type overflow when converted large float64 to int32. I think you should ether change the conversion pipeline or change example to integer weights and add assertion.

How to read the output binary file in Python?

Hi Authors,

Can you please let me know how to read the output binary file as a matrix of |vocab| x |dim| size or in some other consumable fashion? How do I get the vocabulary?

Pankesh

labels of orkut

Hi, there seems to be only graph structure in dataset, I would appreciate it if you could share labels of orkut used in the paper.

How to use these .exe files?

Hi Author,
I want to know if these .exe files with "-weighted" are suitable for weighted graphs? And if "-neigh" corresponds to the "Adjacency similarity" in the paper? Thanks.

icpc command not found, Installation and ussage step

Dear Team,

I have been trying to use this repository from mu Ubuntu system, when I run this command "make CXX=icpc", i get the error "make icpc command not found \n makefile:25: recipe for target 'verse-library' failed". I have looked for this error and tried multiple ways to resolve this issue, but some how nothing seems to work. It would be great if yo can provide me a way to resolve this issue, or to reach the binary file which is generated by these step, so that I can use that binary file in the python wrapper class and execute the code.

Kindly provide any help feedback on this.

Regards,
Sidhant

Issue while using the make command!!

Hi Author,

I am getting the below error while using the make command.

verse.cpp: In function 'void* aligned_malloc(size_t, size_t)':
verse.cpp:80:5: error: 'posix_memalign' was not declared in this scope
80 | if (posix_memalign(&result, align, size)) result = 0;
| ^~~~~~~~~~~~~~
make: *** [makefile:22: verse] Error 1

How can I fix this issue??

question on input format

I'm seeing how to build an R wrapper around this. Can you explain a bit the BCSR format in words. What I'm trying to do is to understand what is in the offsets & edges c++ code you have there. Can you explain that in words so that I am 100% sure what they should be if I passed the data from R to C++

offsets = static_cast<int *>(aligned_malloc((nv + 1) * sizeof(int32_t), DEFAULT_ALIGN));
edges = static_cast<int *>(aligned_malloc(ne * sizeof(int32_t), DEFAULT_ALIGN));
embFile.read(reinterpret_cast<char *>(offsets), nv * sizeof(int32_t));
offsets[nv] = (int)ne;
embFile.read(reinterpret_cast<char *>(edges), sizeof(int32_t) * ne);

Differences between different versions of the code

Hi!
In the article it is said that multiple similarity functions, specifically PageRank, Adj., and SimRank, can be used.
In the codebase, except from recently-added weighted versions, there are three version of code with slight differences: verse, verse-neigh, and verse-simrank.
I don't get which one is which one. Can you please explain a little? 🤔 Are they PageRank, Adj., and SimRank, respectively?

xgfs / verse Goto Github PK

verse's Introduction

VERSE: Versatile Graph Embeddings from Similarity Measures

Installation and usage

Graph file format

Working with embeddings in Python

Citation

Contact

verse's People

Contributors

Stargazers

Watchers

Forkers

verse's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs