pathpy / pathpyg Goto Github PK

View Code? Open in Web Editor NEW

23.0 4.0 0.0 191.38 MB

GPU-accelerated Next-Generation Network Analytics and Graph Learning for Time Series Data on Complex Networks.

Home Page: https://www.pathpy.net

License: GNU Affero General Public License v3.0

Dockerfile 0.29% Python 92.34% CSS 0.10% HTML 0.19% JavaScript 6.80% TeX 0.28%

graph-learning network-analysis deep-learning

pathpyg's Introduction

pathpyG

GPU-accelerated Next-Generation Network Analytics and Graph Learning for Time Series Data on Dynamic Networks.

Documentation

Online documentation is available at pathpy.net.

The docs include a tutorial, an API reference, and other useful information.

Dependencies

pathpyG supports Python 3.7+.

Installation requires numpy, scipy, torch, and torch-geometric.

Installation

The development version can be installed from Github as follows:

pip install git+https://github.com/pathpy/pathpyg.git

Testing

To test pathpy, run pytest in the root directory.

This will exercise both the unit tests and docstring examples (using pytest).

Development

pathpyG development takes place on Github: https://github.com/pathpy/pathpyG

Please submit any reproducible bugs you encounter to the issue tracker.

pathpyg's People

Contributors

Stargazers

Watchers

pathpyg's Issues

Make GitHub tests work

Although the tests are now running, 13 of them fail because CUDA is not available. As far as I know, the default GitHub runners do not have a GPU. We should consider

setting up self-hosted runners to test on GPU (that is what PyG does: https://github.com/pyg-team/pytorch_geometric/blob/master/.github/workflows/full_gpu_testing.yml)
set up the tests in a way that they can be run either on CPU or on GPU (@IngoScholtes)

Update continuous integration/continuous deployment (CI/CD) pipeline

It seems the microsoft/[email protected] CI action is not working with flit.

/opt/hostedtoolcache/Python/3.11.3/x64/bin/python: No module named flit
ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
Error: Process completed with exit code 1.

Implement HotVis higher-order layout

Unintended Boolean Conversions and Ordering Changes in `IndexMap`

Due to the new internal representation of node_ids in IndexMap as numpy.array, there are sometimes some unintended conversions from integers to booleans happening. I also noticed that the ordering of strings is sometimes different than what was inserted at the beginning.

This was first encountered by @lisiq and @chrisbloecker. Maybe the have some more details on this.

Write Documentation

With #35 now merged, the first steps to having a complete documentation are done.
In the next steps, everybody can help to write some parts so that we can complete the documentation soon. Open ToDos:

Check content (I suggest to have at least at least one person check what another person wrote (Four Eyes Principle))
- Check home screen (Especially check the "hero" image and see if someone has a better idea of visualizing PathPyG?)
- Check Getting started (Maybe someone who hasn't done the setup can try if it works)
- Check Contributing page (and try the setup here as well)
Complete Code reference via docstrings
Complete tutorials via Jupyter notebooks
Check legal stuff (Copyright? Impressum? ...)

Setter for node/edge attributes not working

Add to_undirected() for TemporalGraph class

Visualization temporal attributes for nodes

Reintroduce the function from pathypy3 with time-varying node attributes (e.g., colors).

Bug in `Graph.is_edge`

The current implementation raises an IndexError when testing for the presence of a directed edge, where the source node has outdegree zero in the underlying network.

Example:

g = pp.Graph.from_edge_list([['a','b'], ['a','c'], ['a','a']])
g.is_edge('b', 'a')

Refactor PathData to split implementations for Walks and DAGs

Currently, the PathData object can simultaneously hold DAGs and Walks, which is handy for certain temporal graphs but also complicates the code. It also comes at a performance penalty, as we (i) have to store the type of each path (walk or DAG) and (ii) treat each path differently when calculating the k-th order edge index. This prevents vector operations and is likely to eat up the performance gain that we get from the joint storage.

I propose to split those implementations into two classes WalkData and DAGData that both inherit from a common PathData base class.

Improve CUDA performance of `PathData` class

A major performance bottleneck in the current implementation is the use of a dictionary to store path tensors in the class PathData. This creates a huge performance overhead when using CUDA vs. CPU., as each individual tensor in the dictionary must be copied to the GPU separately.

A better solution could possibly be implemented based on the new nested tensor class available in pyTorch 2.1. This would allow to store all paths with varying lengths in a single nested tensor, i.e. we would only need a single copy operation when moving the data to the GPU.

Extend `IndexMap.to_idxs(...)` to any shape

The method IndexMap.to_idxs(...) is meant as a shortcut if you want to convert an iterable of ids to the corresponding idx. In some cases one might want to input an numpy.array or a torch.Tensor with a specific shape. Currently we only allow conversions for one dimensional iterable objects but it would be great if you could input any shape and the output has the same shape and potentially also type. For the other direction (i.e. to_ids(....)) it would also make sense.
Example for the desired behaviour:

>>> mapping = IndexMap(["a", "b", "c"])
>>> print(mapping.to_idxs(np.array([
>>>         ["a", "a", "b"],
>>>         ["b", "c", "c"]]))
np.array([
   [0, 0, 1],
   [1, 2, 2]])

Add option to calculate forces between arbitrary node pairs

Publish stable version for CEE520

Use IndexMapping as node labels by default

I think it would be a good idea if the plot function used the ID-Index mapping stored in Graph.mapping as node labels by default (if such a mapping is defined). If no such mapping is defined, the plot function should use the node indices as labels by default.

This would allow users to omit passing the plot argument node_label=g.mapping.node_id.

Specify pathpyG design principles

Specify design principles in readme.md

Conversion of notebooks of ML4Nets course

To test whether the functionality needed for the ML4Nets course has been implemented in pathpyG, we should start converting the practice notebooks, which are available here.

Add RollingTimeWindow iterator class

Edge Weight Representation

In the previous versions, we saved the number of times each walk or DAG appeared as a count/weight. This would translate to a graph feature for each PyG Data object. For the path statistics, it is probably more convenient to have the count/weight as edge weight. At first we added it in both ways for all representations were it is possible, but that leads to an error when using the PyG DataLoader in MultiOrderModel.from_DAGs. This is fixed for now (#154) but we should think about what representation is best also thinking about the order detection (#107).

Idea: Time-Resolved Line Graph Transformations

Continuing on the discussion I had with @IngoScholtes today, where we were thinking about ways to speed up algorithms.temporal.temporal_graph_to_event_dag that creates a DAG from a temporal graph given a specific time window delta. Since I got this idea on the bike ride home today, I thought that I will write it down before I forget it again. We can discuss it in more detail in the upcoming sprint:

I know that we need the event DAG to construct the higher-order graphs for the temporal interactions. Do we need this anywhere else?
Because if not, then I may have an idea that would skip this step and directly construct the higher-order graph from the temporal graph. The procedure would be as follows:

Construct a graph where each timestamped edge corresponds to one edge in the graph where the time stamp is the edge weight. This might lead to duplicate edges but with different edge weights (time stamps).
Use the indexing-based lift order function inspired by PyG's Line Graph transformation (#132). This will create a 2nd order graph with all possible walks of length 2 without respecting the time. Note that since the edges now correspond to nodes, the edge weights now correspond to node features.
Subtract the node feature (edge timestamp) of the source node from the destination node in the higher-order edge index. This can be done either with message passing, but a simple diff = edge_weight[edge_index[1]] - edge_weight[edge_index[0]] should also do the trick.
Create a mask that filters the higher-order edges 0 < diff < delta and filter the higher-order edge index correspondingly. (Potentially remove isolated nodes.)

Benefits:

This will enable the use of continuous time stamps and does not require discretization.
Can be done on GPU

Potential Problems:

By using the time stamps as edge weights, we need to allow multiple edges between a pair of nodes. We need a sorted edge index (by source) for the line graph transformation to work. This might cause problems because coalesce in PyG removes all duplicate edges and aggregates the edge weights. I am not sure how or if there is sorting implemented for TemporalData but there might already be solutions there.
I am almost but not a 100% certain if this will work for any $k$ -> $k+1$. It should work because each node in the second-order graph corresponds to a time-stamped edge in the first-order graph. So we should only need to filter once from order $1$ -> $2$ and then all further transformations will respect the time automatically. But this leads to the next problem:
The higher-order representations will have multiple nodes for the same edge (at different time steps). Is this the higher-order representation that we need? I think it might be the same representation that we would get from the event DAG. If not, can we maybe map easily to the representation we need?
This might be very memory intensive since this will at first create a new graph with as many nodes as there are time-stamped edges and even more edges.

Refactor `lift_order`-logic from `MultiOrderModel` to `algorithms`

The methods

lift_order_edge_index
lift_order_edge_index_weighted
aggregate_edge_index
potentially have uses outside of MultiOrderModel, e.g. for the centrality calculations. Since they are static methods anyway, it would make sense to separate them from the class altogether and put the into utils or algorithms`.

`HigherOrderGraph.predecessors` returns unexpected values

I suspect that the predecessors method returns wrong values for HigherOrderGraphs.
Right now, it inherits the implementation from Graph.

The self-containing example:

import pathpyG as pp
paths = pp.PathData.from_csv('../data/tube_paths_train.ngram')

k = 2
higher_order = pp.HigherOrderGraph(paths, order=k, node_id=paths.node_id)

nodes = list(higher_order.nodes)

node = nodes[2]

print(node)
print(list(higher_order.successors(node)))      
# -> seems to be okay
print(list(higher_order.predecessors(node)))    
# -> wrong, expected [('Embankment', 'Waterloo') , ('Westminster', 'Waterloo'), ('Kennington', 'Waterloo'), ...]

results in

('Waterloo', 'Southwark')
[('Southwark', 'London Bridge')]
[('Southwark', 'Waterloo'), ('Southwark', 'Waterloo'), ('Southwark', 'Waterloo'), ('Southwark', 'Waterloo')]

You can copy the example to a notebook in the tutorials folder.

Implement Random Walk simulations

Publish stable version for NeurIPS 2023

Refactor TemporalGraph constructor to utilize TemporalData

Refactor core classes `Graph` and `PathData`

check use of pyG.Data (consider use of HeteroData)
implement more convenient uid-index mapping
implement support for string-type node/edge/graph features
simplify interface to access to attributes

Switch from Global `device`-`config` to `tensor`-wise configuration

As we now have almost all of the core functions implemented in torch operations that can utilize the GPU, we have fixed most runtime issues and now run into the next bottleneck namely memory (GPU-RAM). It thus might become necessary to give the user more control over what parts of a Graph object should be stored on CPU or GPU. Although it is more convenient if this is controlled via a global configuration, it might become necessary to add to(device) methods to enable batch-wise computations.

Implement plot functions for pathpyG

notebook crashing when plotting weighted graph

The plots of weighted (higher-order) graphs in the dbgnn tutorial currently leads to the notebook crashing. The resulting HTML output has a lot of NaN errors, which are likely due to layout algorithm.

Minimal example to reproduce the issue:

import pathpyG as pp
pp.config['torch']['device'] = 'cpu'

# read paths
paths = pp.PathData.from_csv('../data/temporal_clusters.ngram')

ho = pp.HigherOrderGraph(paths, order=1)

pp.plot(ho, filename='test.html')

The resulting HTML file does not render and the console logs a large number of errors.

Remove node_id from pyG.Data and pyG.TemporalData in Graph and TemporalGraph classes

Plot to tikz/pdf currently not working

I get the following error:

No such file or directory: '/opt/conda/lib/python3.10/site-packages/pathpyG/visualisations/templates/tikz-network.sty'

Change behavior of TemporalGraph.to_static_graph()

returned graph should be weighted simple graph

Add proper logging functionality

use .toml config to change logging
Create a custom pathpyG logger
allow logging to file and console

Fix file dependencies in Jupyter Tutorials

With the recent merges #110 and #109, we now provide links to check out our tutorials directly via Google Colab. But there are some errors on Colab due to the file dependencies in some of the notebooks.
Open Tasks:

Fix FileNotFoundError in temporal_graphs.ipynb
Come up with a solution to missing files when using Google Colab (temporal_graphs.ipynb, dbgnn.ipynb and paths_higher_order.ipynb)
Since only the Notebook is opened via Colab, the accompanying files from the repository are not copied over and, thus, an exception is thrown. This could be either fixed by using datasets from Netzschleuder instead or by manually downloading the files from our GitHub Repo in the notebook if they are not available.

Cleanup pyproject.toml

Remove references to Microsoft
~~use poetry for dependency management~~
Add linting rules
Add formating rules

CUDA vs CPU

Add code to detect if CUDA is available; otherwise, use CPU as default.

Performance issue with HTML-based plots

In the latest version of the plot function, interactive HTML-based plots of larger networks are very slow.

For example, the plots of second-order networks for the example in the dbgnn.ipynb tutorial do not work anymore and render the notebook unusable. Changing the network to an undirected network does not help either. Plotting of the same networks worked fine with the previous version.

Image visualization in nodes

Display (selected) nodes using small images instead of simple circles.

Implement temporal Cartesian layout

Time-varying node positions with an interpolation between the specified points in time, i.e. a "temporal layout" with time-stamped positions.

Refactor HigherOrderGraph and PathData classes

Change filenames

Currently files containing a class are named in CamelCase. We should follow PEP8 convention and name them in lowercase because this can otherwise lead to confusion during imports.
Example:
You would think that by importing the following, you would have the Graph class.

from pathpyG.core import Graph

Instead you now have the module Graph and would need to instantiate the class Graph as follows:

g = Graph.Graph(...)

from pathpyG.core.Graph import Graph

I think we should rename all files containing classes and import the classes in the modules __init__.py so that we could import e.g. Graph in all of the following ways:

from pathpyG import Graph
from pathpyG.core import Graph
from pathpyG.core.graph import Graph

I hope this will not lead to any circular imports! XD

`TemporalGraph.get_window(...)` and `TemporalGraph.get_snapshot` behaviour not as expected

As I understand both methods, get_snapshot should return a TemporalGraph that contains all edges from the original graph that occur in the time window from start to end. This means that the returned edge index can potentially have very different sizes. See the following example for the expected behaviour:

>>> t = TemporalGraph(TemporalData(
>>>    src=[0,0,1,0],
>>>    dst=[1,2,2,1],
>>>    t=[1,1,1,2]))
>>> t.get_snapshot(start=1, end=2)
TemporalGraph containing the first 3 edges

>>> t.get_snapshot(start=2, end=3)
TemporalGraph containing only the last edge

get_window(...) instead should return a window of fixed size end-start that contains the edges starting at index start and ending (non-inclusive) at index end in an edge index sorted by time. Thus, e.g. get_window(0,5) would return the first 5 events. The current implementation of get_window does exactly this but get_snapshot(...) also does this.

ALSO: For get_window to consistently work, we have to ensure to keep an edge_index sorted by time. This is not the case after applying shuffle_time in its current implementation.

Python version 3.9 and lower not supported

In hist_plots.py, we are using a Python functionality added in version 3.10. We should think about if and how we ensure backwards compatibility.
https://github.com/pathpy/pathpyG/blob/5c82779f608848584ecc38dd34aea4afc5d7ecfc/src/pathpyG/visualisations/hist_plots.py#L45C1-L57C31

Implement order detection

Add convenience method to `MultiOrderModel` to create a `PyG`-`Data` object that can be used by DBGNN

The title should say it all...

PyG backwards compatibility 2.3.1 <- 2.4.0

This is something that already came up in one of our meetings, that self.data.keys worked with or without () depending on the versions. This is caused by PyG where they turned keys() into a method in version 2.4.0 while it was a property in version 2.3.1.

Version 2.4.0: https://pytorch-geometric.readthedocs.io/en/2.4.0/generated/torch_geometric.data.Data.html#torch_geometric.data.Data.keys
Version 2.3.1: https://pytorch-geometric.readthedocs.io/en/2.3.1/generated/torch_geometric.data.Data.html#torch_geometric.data.Data.keys
We should somehow ensure backwards compatibility.

Extend `append_DAG` to work with nodes appearing at multiple points in time

@VincenzoPerri raised the issue that we currently cannot add a DAG that should contain the same node at multiple time stamps. We already fixed this issue for append_walk by reindexing all nodes contained in the walk. It is not as trivial for DAGs since a node can have multiple incoming and outgoing edges so we cannot use a new index for each occurrence in the edge index.
One solution could be that we use the combination of node index and time stamp to reindex the nodes.

Add node labels and edge directions to plots

Fix switch between versions 0.0.1 and 0.0.1-dev on web page

Currently, the version shown for the code reference is preset to 0.0.1, while the latest version from the main branch is under 0.0.1-dev. Moreover, switching from 0.0.1 to 0.0.1-dev fails for most of the pages, ending in a 404 error (e.g. from the Tutorial page).

I would suggest to remove the version tags for now, and only keep the dev version that always refers to the latest version in the main branch. We can later add support for version tags defined in github.

Code reference pages return 404 error

This holds for all classes in the core module, e.g. this on as well as for the io and nn modules.